The objective of this study is to investigate the negative health effects associated with smoking, including both physical and mental health outcomes. Given the widespread prevalence of smoking and the rising use of electronic nicotine products, it is crucial to understand the extent to which these habits impact overall health. Our analysis will focus on identifying the associations between smoking and various health-related issues, as well as exploring the relationship between smoking habits and mental health indicators. Additionally, we will compare these findings with the health effects associated with the use of electronic nicotine products.
The central question we want to answer is: what are the negative health effects associated with smoking? We will look at the association between smoking and health related issues. We will also look at some mental health indicators and how they relate to smoking habits. We will also make an interesting comparison with usage of electronic nicotine products.
The negative health effects of smoking have been well-documented, yet the comprehensive impact on both physical and mental health continues to be a critical area of study. This research aims to analyze the associations between smoking and health-related issues using data from the 2021 Medical Expenditure Panel Survey Household Component, which surveyed 15,000 households in the United States. By examining physical health outcomes and mental health indicators, this study seeks to provide a detailed understanding of how smoking affects individuals. Furthermore, we will compare these effects with those associated with the use of electronic nicotine products, offering insights into the potential risks of these increasingly popular alternatives. Our findings aim to inform public health strategies and contribute to ongoing efforts to mitigate the health risks associated with smoking and electronic nicotine product usage.
The dataset we are working with is the Medical Expenditure Panel Survey Household component of 2021. It’s a survey conducted on 15000 households in the United States of America.
Here is info about characteristics: https://meps.ahrq.gov/data_files/publications/st554/stat554.shtml
Here is questionary report: https://meps.ahrq.gov/survey_comp/hc_survey/paper_quest/2021/2021_SHE_SAQ.pdf
Codebook: https://meps.ahrq.gov/data_stats/download_data/pufs/h233/h233cb.pdf
# the dataset is in the github repo, we don't need to download it every time
h233 <- read_dta("h233.dta")
head(h233)## # A tibble: 6 × 1,488
## DUID PID DUPERSID PANEL FAMID31 FAMID42 FAMID53 FAMID21 FAMIDYR
## <dbl> <dbl> <chr> <dbl+lbl> <chr> <chr> <chr> <chr> <chr>
## 1 2320005 101 2320005101 23 [23 PANEL… A A A A A
## 2 2320005 102 2320005102 23 [23 PANEL… A A A A A
## 3 2320006 101 2320006101 23 [23 PANEL… A A A A A
## 4 2320006 102 2320006102 23 [23 PANEL… B B B B B
## 5 2320006 103 2320006103 23 [23 PANEL… A A A A A
## 6 2320012 102 2320012102 23 [23 PANEL… A A A A A
## # ℹ 1,479 more variables: CPSFAMID <chr>, FCSZ1231 <dbl+lbl>,
## # FCRP1231 <dbl+lbl>, RULETR31 <chr>, RULETR42 <chr>, RULETR53 <chr>,
## # RULETR21 <chr>, RUSIZE31 <dbl+lbl>, RUSIZE42 <dbl+lbl>, RUSIZE53 <dbl+lbl>,
## # RUSIZE21 <dbl+lbl>, RUCLAS31 <dbl+lbl>, RUCLAS42 <dbl+lbl>,
## # RUCLAS53 <dbl+lbl>, RUCLAS21 <dbl+lbl>, FAMSZE31 <dbl+lbl>,
## # FAMSZE42 <dbl+lbl>, FAMSZE53 <dbl+lbl>, FAMSZE21 <dbl+lbl>,
## # FMRS1231 <dbl+lbl>, FAMS1231 <dbl+lbl>, FAMSZEYR <dbl+lbl>, …
Our main explanatory variables:
ADSMOK42: currently smokes
SDENICPROD: used electronic nicotine product
The other variables we selected are:
ADNSMK42 - If ADSMOK42 = 1, doctor advised you to quit smoking
Other variables we selected can be grouped as follows:
Demographic variables: AGELAST, REGION21, SEX, RACEV2X, MARRY21X, FTSTU21X, TTLP21X, FAMINC21, EMPST31
Mental health: ADDPRS42, ADHOPE42, ADNERV42, ADPCFL42, ADSAD42, MNHLTH31
General health: ADGENH42, ADENGY42, RTHLTH31, WLKLIM53
Disease Diagnosis: ASTHDX, ARTHDX, CALUNG, CHOLDX, CHBRON31, DIABDX_M18, EMPHDX, HIBPDX, STRKDX
vars_to_keep = c("ADSMOK42","SDENICPROD","ADDPRS42","ADGENH42",
"ADENGY42","ADHOPE42","ADNERV42","ADPCFL42","ADSAD42",
"AGELAST","ASTHDX","CALUNG","CHOLDX","DIABDX_M18","EMPHDX",
"FAMINC21","HIBPDX","MNHLTH31","RTHLTH31","TTLP21X",
"WLKLIM53","SDCOMPAN","SDLEFTOUT","SDLIFE", 'REGION21', 'RACEV2X',
'FTSTU21X', 'SEX','STRKDX','HIBPDX', 'MARRY21X','CHBRON31' ,'CANCERDX',
'ARTHDX', 'ADNSMK42', 'SDHOME', 'SDMEDCARE','SDHLTHFOOD', 'SDISOL',
'SDHMDEPR','SDHMALC','SDHMDRG', 'SDHMDIV', 'SDHMBEAT', 'SDHURTCHLD', 'EMPST31')
small_df = h233 %>% select(all_of(vars_to_keep)) %>% as_factor()
#turning the variables into factors
small_df = small_df %>% mutate(age_group = cut(AGELAST,
breaks = c(-Inf,18, 25, 35, 45, 65, Inf),
labels = c("0-18","18-24", "25-34", "35-44", "45-64", "65+"),
right = FALSE))
#adding a new variable to break into age groups
small_df %>% summary()## ADSMOK42 SDENICPROD
## -15 CANNOT BE COMPUTED: 0 -15 CANNOT BE COMPUTED: 216
## -8 DK : 0 -1 INAPPLICABLE : 9936
## -7 REFUSED : 0 1 YES : 2419
## -1 INAPPLICABLE :12914 2 NO :15765
## 1 YES : 2831
## 2 NO :12591
##
## ADDPRS42 ADGENH42
## -15 CANNOT BE COMPUTED : 225 -15 CANNOT BE COMPUTED: 482
## -1 INAPPLICABLE :12914 -1 INAPPLICABLE :12914
## 0 NOT AT ALL :11473 1 EXCELLENT : 2461
## 1 SEVERAL DAYS : 2673 2 VERY GOOD : 5284
## 2 MORE THAN HALF THE DAYS: 665 3 GOOD : 4888
## 3 NEARLY EVERY DAY : 386 4 FAIR : 1953
## 5 POOR : 354
## ADENGY42 ADHOPE42
## -1 INAPPLICABLE :12914 -15 CANNOT BE COMPUTED: 228
## 2 MOST OF THE TIME : 5565 -1 INAPPLICABLE :12914
## 3 A GOOD BIT OF THE TIME: 2923 0 NONE OF THE TIME :11623
## 4 SOME OF THE TIME : 2634 1 LITTLE OF THE TIME : 2009
## 1 ALL OF THE TIME : 1744 2 SOME OF THE TIME : 1033
## 5 LITTLE OF THE TIME : 1361 3 MOST OF THE TIME : 381
## (Other) : 1195 4 ALL OF THE TIME : 148
## ADNERV42 ADPCFL42
## -15 CANNOT BE COMPUTED: 187 -1 INAPPLICABLE :12914
## -1 INAPPLICABLE :12914 2 MOST OF THE TIME : 6314
## 0 NONE OF THE TIME : 7832 1 ALL OF THE TIME : 2639
## 1 LITTLE OF THE TIME : 4507 3 A GOOD BIT OF THE TIME: 2636
## 2 SOME OF THE TIME : 2107 4 SOME OF THE TIME : 2061
## 3 MOST OF THE TIME : 605 5 LITTLE OF THE TIME : 885
## 4 ALL OF THE TIME : 184 (Other) : 887
## ADSAD42 AGELAST ASTHDX
## -15 CANNOT BE COMPUTED: 219 Min. : 0.0 -15 CANNOT BE COMPUTED: 94
## -1 INAPPLICABLE :12914 1st Qu.:23.0 -8 DK : 20
## 0 NONE OF THE TIME :12089 Median :44.0 -7 REFUSED : 23
## 1 LITTLE OF THE TIME : 1837 Mean :43.2 -1 INAPPLICABLE : 31
## 2 SOME OF THE TIME : 855 3rd Qu.:64.0 1 YES : 4022
## 3 MOST OF THE TIME : 299 Max. :85.0 2 NO :24146
## 4 ALL OF THE TIME : 123
## CALUNG CHOLDX
## -15 CANNOT BE COMPUTED: 0 -15 CANNOT BE COMPUTED: 186
## -8 DK : 11 -8 DK : 43
## -7 REFUSED : 1 -7 REFUSED : 23
## -1 INAPPLICABLE :25428 -1 INAPPLICABLE : 5553
## 1 YES : 88 1 YES : 7925
## 2 NO : 2808 2 NO :14606
##
## DIABDX_M18 EMPHDX FAMINC21
## -15 CANNOT BE COMPUTED: 0 -15 CANNOT BE COMPUTED: 185 Min. :-88503
## -8 DK : 22 -8 DK : 18 1st Qu.: 26872
## -7 REFUSED : 23 -7 REFUSED : 23 Median : 58000
## -1 INAPPLICABLE : 125 -1 INAPPLICABLE : 5553 Mean : 81004
## 1 YES : 3265 1 YES : 483 3rd Qu.:110172
## 2 NO :24901 2 NO :22074 Max. :663883
##
## HIBPDX MNHLTH31 RTHLTH31
## -15 CANNOT BE COMPUTED: 186 1 EXCELLENT :8786 2 VERY GOOD :8831
## -8 DK : 37 2 VERY GOOD :8604 1 EXCELLENT :7801
## -7 REFUSED : 22 3 GOOD :7595 3 GOOD :7556
## -1 INAPPLICABLE : 5553 4 FAIR :2186 4 FAIR :2824
## 1 YES : 8538 -1 INAPPLICABLE: 573 5 POOR : 681
## 2 NO :14000 5 POOR : 512 -1 INAPPLICABLE: 573
## (Other) : 80 (Other) : 70
## TTLP21X WLKLIM53 SDCOMPAN
## Min. :-88503 -8 DK : 4 -1 INAPPLICABLE :9936
## 1st Qu.: 0 -7 REFUSED : 29 1 NEVER :7135
## Median : 20000 -1 INAPPLICABLE:13428 3 SOMETIMES :4724
## Mean : 34920 1 YES : 1811 2 RARELY :4596
## 3rd Qu.: 49513 2 NO :13064 4 OFTEN :1816
## Max. :390565 -15 CANNOT BE COMPUTED: 124
## (Other) : 5
## SDLEFTOUT SDLIFE
## -1 INAPPLICABLE :9936 -1 INAPPLICABLE :9936
## 1 NEVER :6635 2 VERY SATISFIED :8106
## 2 RARELY :5883 3 SOMEWHAT SATISFIED :4892
## 3 SOMETIMES :4570 1 COMPLETELY SATISFIED:4007
## 4 OFTEN :1178 4 A LITTLE SATISFIED : 881
## -15 CANNOT BE COMPUTED: 130 5 NOT AT ALL SATISFIED: 265
## (Other) : 4 (Other) : 249
## REGION21 RACEV2X
## -1 INAPPLICABLE: 233 1 WHITE - NO OTHER RACE REPORTED :21262
## 1 NORTHEAST : 4403 2 BLACK - NO OTHER RACE REPORTED : 4232
## 2 MIDWEST : 5434 12 MULTIPLE RACES REPORTED : 1059
## 3 SOUTH :10738 10 OTH ASIAN/NATV HAWAIIAN/PACFC ISL-NO OTH: 574
## 4 WEST : 7528 4 ASIAN INDIAN - NO OTHER RACE REPORTED : 409
## 5 CHINESE - NO OTHER RACE REPORTED : 304
## (Other) : 496
## FTSTU21X SEX
## -15 CANNOT BE COMPUTED: 0 -15 CANNOT BE COMPUTED: 0
## -8 DK : 0 -8 DK : 0
## -7 REFUSED : 3 -7 REFUSED : 0
## -1 INAPPLICABLE :26037 -1 INAPPLICABLE : 0
## 1 FULL-TIME STUDENT : 1091 1 MALE :13423
## 2 PART-TIME STUDENT : 137 2 FEMALE :14913
## 3 NOT A STUDENT : 1068
## STRKDX MARRY21X
## -15 CANNOT BE COMPUTED: 185 1 MARRIED :10526
## -8 DK : 19 5 NEVER MARRIED : 7087
## -7 REFUSED : 23 6 UNDER AGE 16 - INAPPLICABLE: 4844
## -1 INAPPLICABLE : 5553 3 DIVORCED : 3261
## 1 YES : 1143 2 WIDOWED : 2058
## 2 NO :21413 4 SEPARATED : 548
## (Other) : 12
## CHBRON31 CANCERDX
## -15 CANNOT BE COMPUTED: 2 -15 CANNOT BE COMPUTED: 189
## -8 DK : 67 -8 DK : 17
## -7 REFUSED : 54 -7 REFUSED : 24
## -1 INAPPLICABLE : 6144 -1 INAPPLICABLE : 5553
## 1 YES : 337 1 YES : 2908
## 2 NO :21732 2 NO :19645
##
## ARTHDX ADNSMK42
## -15 CANNOT BE COMPUTED: 185 -15 CANNOT BE COMPUTED : 38
## -8 DK : 27 -8 DK : 0
## -7 REFUSED : 24 -7 REFUSED : 0
## -1 INAPPLICABLE : 5553 -1 INAPPLICABLE :25505
## 1 YES : 6666 1 YES : 1104
## 2 NO :15881 2 NO : 1265
## 3 HAD NO VISITS IN THE LAST 12 MONTHS: 424
## SDHOME SDMEDCARE SDHLTHFOOD
## -1 INAPPLICABLE :9936 -1 INAPPLICABLE:9936 -1 INAPPLICABLE:9936
## 2 VERY SATISFIED :7330 2 VERY GOOD :6245 2 VERY GOOD :5954
## 1 COMPLETELY SATISFIED:5970 1 EXCELLENT :6152 1 EXCELLENT :5526
## 3 SOMEWHAT SATISFIED :3598 3 GOOD :3938 3 GOOD :4159
## 4 A LITTLE SATISFIED : 814 4 FAIR :1364 4 FAIR :1711
## 5 NOT AT ALL SATISFIED: 444 5 POOR : 439 5 POOR : 707
## (Other) : 244 (Other) : 262 (Other) : 343
## SDISOL SDHMDEPR
## -1 INAPPLICABLE :9936 -15 CANNOT BE COMPUTED: 226
## 1 NEVER :7206 -1 INAPPLICABLE : 9936
## 2 RARELY :4987 1 YES : 4119
## 3 SOMETIMES :4415 2 NO :14055
## 4 OFTEN :1660
## -15 CANNOT BE COMPUTED: 131
## (Other) : 1
## SDHMALC SDHMDRG
## -15 CANNOT BE COMPUTED: 226 -15 CANNOT BE COMPUTED: 221
## -1 INAPPLICABLE : 9936 -1 INAPPLICABLE : 9936
## 1 YES : 4472 1 YES : 1957
## 2 NO :13702 2 NO :16222
##
##
##
## SDHMDIV SDHMBEAT
## -15 CANNOT BE COMPUTED: 242 -15 CANNOT BE COMPUTED : 257
## -1 INAPPLICABLE : 9936 -1 INAPPLICABLE : 9936
## 1 YES : 4437 1 NEVER :14647
## 2 NO :12550 2 ONCE : 1208
## 8 PARENTS NOT MARRIED : 1171 3 MORE THAN ONCE : 2286
## 4 NEVER OR ONCE : 0
## 5 ONCE OR MORE THAN ONCE: 2
## SDHURTCHLD EMPST31
## -15 CANNOT BE COMPUTED : 246 1 EMPLOYED AT RD 3/1 INT DATE :11989
## -1 INAPPLICABLE : 9936 4 NOT EMPLOYED DURING RD 3/1 :10186
## 1 NEVER :14042 -1 INAPPLICABLE : 5456
## 2 ONCE : 1415 3 JOB DURING RD 3/1 REF PERIOD : 267
## 3 MORE THAN ONCE : 2695 -7 REFUSED : 159
## 4 NEVER OR ONCE : 2 2 JOB TO RETURN TO AT RD 3/1 INT DATE: 132
## 5 ONCE OR MORE THAN ONCE: 0 (Other) : 147
## age_group
## 0-18 :5557
## 18-24:1988
## 25-34:3306
## 35-44:3479
## 45-64:7299
## 65+ :6707
##
We start by looking at the age distribution of people that took the survey
#smokers by age group
small_df %>% select(ADSMOK42, age_group) %>% group_by(age_group) %>%
summarise(n_smokers = sum(ADSMOK42 == "1 YES"),
n_non_smokers = sum(ADSMOK42 == "2 NO"),
n_inapplicable = sum(ADSMOK42 == "-1 INAPPLICABLE")) %>%
mutate(perc_smokers = 100*n_smokers/n_non_smokers) %>%
slice(-1) %>%
ggplot(aes(x = age_group)) +
geom_col(aes(y = n_non_smokers, fill = "non smokers")) +
geom_col(aes(y = n_smokers, fill = "smokers")) +
geom_text(aes(label=paste(round(perc_smokers,2),"%"), y = 130)) +
labs(title = "Age distribution of smokers and non smokers",
y = "count", x = "age group", fill = "") + theme_classic()# age distribution of users of electronic cigarettes
small_df %>% select(SDENICPROD, age_group) %>% group_by(age_group) %>%
summarise(n_smokers = sum(SDENICPROD == "1 YES"),
n_non_smokers = sum(SDENICPROD == "2 NO"),
n_inapplicable = sum(SDENICPROD == "-1 INAPPLICABLE")) %>%
mutate(perc_smokers = 100*n_smokers/n_non_smokers) %>%
slice(-1) %>%
ggplot(aes(x = age_group)) +
geom_col(aes(y = n_non_smokers, fill = "non users")) +
geom_col(aes(y = n_smokers, fill = "users")) +
geom_text(aes(label=paste(round(perc_smokers,2),"%"), y = 130)) +
labs(title = "Age distribution of electronic nicotine product usage",
y = "count", x = "age group", fill = "") + theme_classic()Smoking rate increases with age, till the 45-64 age group. They are the most likely to smoke. However the last age group, 65 and over is the one with the lowest rate. This fact is key to understand the effect of age: as is well known, general health decreases sharply with old age, so it will be important in our analysis to always control for the confounding effect of age.
Looking at electronic nicotine usage, we see a very different picture: in the youngest group there is a really high rate, 41%. Almost one every two young adults uses those products. The rate is decreasing with age: every older generation show a large decrease from the younger one. The 65+ group, which as we have already said is the one with the worst health, sees almost no usage: just over 5%.
The two variables will allow us to make interesting comparisons since the makeup of the groups is very different.
small_df <- small_df %>%
mutate(
sex_group = case_when(
SEX == '1 MALE' ~ 'MALE',
SEX == '2 FEMALE' ~ 'FEMALE',
TRUE ~ NA_character_
)
)
smokers_df <- small_df %>% filter(ADSMOK42 == "1 YES")
non_smokers_df <- small_df %>% filter(ADSMOK42 == "2 NO")
e_smokers_df <-small_df %>% filter(SDENICPROD == "1 YES")create_plots_demogr <- function(data, demographic_variables, titles) {
data <- data %>%
mutate(
tot_inc_group = cut(TTLP21X,
breaks = c(-Inf, 10000, 15000, 20000, 25000, 30000, Inf),
labels = c("0-10000", "10001-15000", "15001-20000", "20001-25000", "25001-30000", "30000+"),
right = FALSE),
fam_inc_group = cut(FAMINC21,
breaks = c(-Inf, 10000, 15000, 20000, 25000, 30000, Inf),
labels = c("0-10000", "10001-15000", "15001-20000", "20001-25000", "25001-30000", "30000+"),
right = FALSE)
) %>%
filter(!is.na(tot_inc_group) & !is.na(fam_inc_group) & !is.na(sex_group)) %>%
filter(tot_inc_group != "NA")
create_plots <- function(data, variable, title) {
my_colors <- c("skyblue", "salmon", "lightgreen", "lightpink", "lightblue", "lightcoral")
barplot(table(data[[variable]]), main = title, xlab = variable, ylab = "Frequency", las = 2, cex.names = 0.8, col = my_colors)
}
par(mfrow = c(1, 3))
for (variable in demographic_variables) {
create_plots(data, variable, titles[[variable]])
}
}
demographic_variables <- c("age_group", "REGION21", "sex_group", "RACEV2X", "MARRY21X", "FTSTU21X", "tot_inc_group", "fam_inc_group", "EMPST31")
titles_smokers <- c(
age_group = "Age Group Distribution (Smokers)",
REGION21 = "Region Distribution (Smokers)",
sex_group = "Gender Distribution (Smokers)",
RACEV2X = "Race Distribution (Smokers)",
MARRY21X = "Marital Status Distribution (Smokers)",
FTSTU21X = "Full-Time Student Distribution (Smokers)",
tot_inc_group = "Total Income Distribution (Smokers)",
fam_inc_group = "Family Income Distribution (Smokers)",
EMPST31 = "Employment Status Distribution (Smokers)"
)
titles_non_smokers <- c(
age_group = "Age Group Distribution (Non-Smokers)",
REGION21 = "Region Distribution (Non-Smokers)",
sex_group = "Gender Distribution (Non-Smokers)",
RACEV2X = "Race Distribution (Non-Smokers)",
MARRY21X = "Marital Status Distribution (Non-Smokers)",
FTSTU21X = "Full-Time Student Distribution (Non-Smokers)",
tot_inc_group = "Total Income Distribution (Non-Smokers)",
fam_inc_group = "Family Income Distribution (Non-Smokers)",
EMPST31 = "Employment Status Distribution (Non-Smokers)"
)
create_plots_demogr(smokers_df, demographic_variables, titles_smokers)
Highlights A collective image of smokers: *
40% are 45-64 y.o. * 44% living in South region, ~22% in Midwest and
~22% in West * 49% of male and 51% of female * 70% of white race and 22%
of black * 32% never married, 31% married, just 21% divorced and 10%
widowed * 4% not a students, 1% full-time student, 0,31% part-time
students * Mean of total income is 30586, ~51% earn 30000+, ~25% earn
less than 10000 * Mean of total income of family is 50665, 55% earn
30000+, other family’s income distributed gradually in income range *
50% are not employed during RD, 46% employed
A collective image of non-smokers: * 36% are 65+ y.o., 31% are 45-64 y.o. * 37% living in South region, 22% in Midwest and ~26% in West * 56% of male and 44% of female * 79% of white race and 11% of black * 24% never married, 50% married, just 14% divorced and 10% widowed * 4% not a students, 2,6 % full-time student, 0,04% part-time students * Mean of total income is 47740, ~55% earn 30000+ * Mean of total income of family is 85484 , 44% earn 30000+,other family’s income distributed gradually in income range * 45% are not employed during RD, 53% employed
Did smokers receive a doctor’s recommendation to quit?
ggplot(smokers_df, aes(x = ADNSMK42)) +
geom_bar(fill = "skyblue") +
labs(title = "Distribution of Smokers by Doctor's Recommendation", x = "Doctor's Recommendation", y = "Frequency") + theme_classic()
38% of current smokers still smokes after doctor’s advice to quit
smoke
Smokers that also make use of electronic nicotine products
ggplot(smokers_df, aes(x = SDENICPROD == "1 YES")) +
geom_bar(fill = "lightcoral") +
labs(title = "Distribution of e-Smokers through current smokers", x = "E-smokers No/Yes", y = "Count") + theme_classic()
Current smokers make use of e-cigarettes at only 21% rate
This is the main reason why we looked at the two explanatory variables separately: the group that uses both is really small with only 700 people in a sample of 28,000 people
#mental health
create_plots_mental_health <- function(data, mental_health_variables, titles) {
data <- data %>%
filter(!is.na(sex_group))
create_plots <- function(data, variable, title) {
my_colors <- c("skyblue", "salmon", "lightgreen", "lightpink", "lightblue", "lightcoral")
barplot(table(data[[variable]]), main = title, xlab = variable, ylab = "Frequency", las = 2, cex.names = 0.8, col = my_colors)
}
par(mfrow = c(1, 3))
for (variable in mental_health_variables) {
create_plots(data, variable, titles[[variable]])
}
}
mental_health_variables <- c("ADDPRS42", "ADHOPE42", "ADNERV42", "ADPCFL42", "ADSAD42", "MNHLTH31")
titles_smokers_mental_health <- c(
ADDPRS42 = "Perceived Stress (Smokers)",
ADHOPE42 = "Hopefulness (Smokers)",
ADNERV42 = "Nervousness (Smokers)",
ADPCFL42 = "Psychological Fatigue (Smokers)",
ADSAD42 = "Sadness (Smokers)",
MNHLTH31 = "Mental Health Status (Smokers)"
)
titles_non_smokers_mental_health <- c(
ADDPRS42 = "Perceived Stress (Non-Smokers)",
ADHOPE42 = "Hopefulness (Non-Smokers)",
ADNERV42 = "Nervousness (Non-Smokers)",
ADPCFL42 = "Psychological Fatigue (Non-Smokers)",
ADSAD42 = "Sadness (Non-Smokers)",
MNHLTH31 = "Mental Health Status (Non-Smokers)"
)
titles_e_smokers_mental_health <- c(
ADDPRS42 = "Perceived Stress (E-Smokers)",
ADHOPE42 = "Hopefulness (E-Smokers)",
ADNERV42 = "Nervousness (E-Smokers)",
ADPCFL42 = "Psychological Fatigue (E-Smokers)",
ADSAD42 = "Sadness (E-Smokers)",
MNHLTH31 = "Mental Health Status (E-Smokers)"
)
create_plots_mental_health(smokers_df, mental_health_variables, titles_smokers_mental_health)create_plots_mental_health(non_smokers_df, mental_health_variables, titles_non_smokers_mental_health)
Highlights
*In comparison, non-smokers mostly don’t feel perceived stress, hopefulness and sadness. However, that mental health status is better: 40% have very good, 28% excellent and 28% good, just 7% fair.
*Another picture for e-smokers: mostly of them don’t have perceived stress and hopefulness, but 24% fill little of time nervous and 15% some of the time. 27% little of the time have psychological fatigue, ~16% have it most of the time. Overall, 14% feel fair, mental health is more than less good or excellent.
create_plots_general_health <- function(data, general_health_variables, titles) {
data <- data %>%
filter(!is.na(sex_group))
create_plots <- function(data, variable, title) {
my_colors <- c("skyblue", "salmon", "lightgreen", "lightpink", "lightblue", "lightcoral")
barplot(table(data[[variable]]), main = title, xlab = variable, ylab = "Frequency", las = 2, cex.names = 0.8, col = my_colors)
}
par(mfrow = c(1, 2))
for (variable in general_health_variables) {
create_plots(data, variable, titles[[variable]])
}
}
general_health_variables <- c("ADGENH42", "ADENGY42", "RTHLTH31", "WLKLIM53")
titles_smokers_general_health <- c(
ADGENH42 = "General Health Perception (Smokers)",
ADENGY42 = "Energy Level (Smokers)",
RTHLTH31 = "Overall Health Rating (Smokers)",
WLKLIM53 = "Walking Limitation (Smokers)"
)
titles_non_smokers_general_health <- c(
ADGENH42 = "General Health Perception (Non-Smokers)",
ADENGY42 = "Energy Level (Non-Smokers)",
RTHLTH31 = "Overall Health Rating (Non-Smokers)",
WLKLIM53 = "Walking Limitation (Non-Smokers)"
)
titles_e_smokers_general_health <- c(
ADGENH42 = "General Health Perception (E-Smokers)",
ADENGY42 = "Energy Level (E-Smokers)",
RTHLTH31 = "Overall Health Rating (E-Smokers)",
WLKLIM53 = "Walking Limitation (E-Smokers)"
)
create_plots_general_health(smokers_df, general_health_variables, titles_smokers_general_health)create_plots_general_health(non_smokers_df, general_health_variables, titles_non_smokers_general_health)create_plots_general_health(e_smokers_df, general_health_variables, titles_e_smokers_general_health)
Highlights:
*Little another picture for e-smokers.Their energy level some times lower and distribution on overall health rate is higher tan for smokers and non-smokers. 45% don’t have a walking limitations
#deseases
create_plots_health_disease <- function(data, health_disease_variables, titles) {
data <- data %>%
filter(!is.na(sex_group))
create_plots <- function(data, variable, title) {
my_colors <- c("skyblue", "salmon", "lightgreen", "lightpink", "lightblue", "lightcoral")
vector_to_plot = data %>% filter(.data[[variable]] %in% c("1 YES", "2 NO")) %>%
pull(variable) %>% factor()
barplot(table(vector_to_plot), main = title, xlab = variable, ylab = "Frequency", las = 2, cex.names = 0.8, col = my_colors)
}
par(mfrow = c(1, 3))
for (variable in health_disease_variables) {
create_plots(data, variable, titles[[variable]])
}
}
health_disease_variables <- c("ASTHDX", "ARTHDX", "CALUNG", "CHOLDX", "CHBRON31", "DIABDX_M18", "EMPHDX", "HIBPDX", "STRKDX")
titles_smokers_health_disease <- c(
ASTHDX = "Asthma Diagnosis (Smokers)",
ARTHDX = "Arthritis Diagnosis (Smokers)",
CALUNG = "Lung Cancer Diagnosis (Smokers)",
CHOLDX = "Cholesterol Diagnosis (Smokers)",
CHBRON31 = "Chronic Bronchitis Diagnosis (Smokers)",
DIABDX_M18 = "Diabetes Diagnosis (Smokers)",
EMPHDX = "Emphysema Diagnosis (Smokers)",
HIBPDX = "High Blood Pressure Diagnosis (Smokers)",
OHRTDX = "Heart Disease Diagnosis (Smokers)",
OHRTTYPE = "Type of Heart Disease Diagnosis (Smokers)",
STRKDX = "Stroke Diagnosis (Smokers)"
)
titles_non_smokers_health_disease <- c(
ASTHDX = "Asthma Diagnosis (Non-Smokers)",
ARTHDX = "Arthritis Diagnosis (Non-Smokers)",
CALUNG = "Lung Cancer Diagnosis (Non-Smokers)",
CHOLDX = "Cholesterol Diagnosis (Non-Smokers)",
CHBRON31 = "Chronic Bronchitis Diagnosis (Non-Smokers)",
DIABDX_M18 = "Diabetes Diagnosis (Non-Smokers)",
EMPHDX = "Emphysema Diagnosis (Non-Smokers)",
HIBPDX = "High Blood Pressure Diagnosis (Non-Smokers)",
OHRTDX = "Heart Disease Diagnosis (Non-Smokers)",
OHRTTYPE = "Type of Heart Disease Diagnosis (Non-Smokers)",
STRKDX = "Stroke Diagnosis (Non-Smokers)"
)
create_plots_health_disease(smokers_df, health_disease_variables, titles_smokers_health_disease)create_plots_health_disease(non_smokers_df, health_disease_variables, titles_non_smokers_health_disease)We immediately see that some diseases are very rare, while others are quite common.
condition_labels <- c(
"Asthma Diagnosis",
"Arthritis Diagnosis",
"Lung Cancer Diagnosis",
"Cholesterol Diagnosis",
"Chronic Bronchitis Diagnosis",
"Diabetes Diagnosis",
"Emphysema Diagnosis",
"High Blood Pressure Diagnosis",
"Stroke Diagnosis"
)
diseases_percentages = small_df %>% filter(AGELAST >= 18) %>%
select(ADSMOK42, all_of(health_disease_variables)) %>% group_by(ADSMOK42) %>%
summarise(ASTHDX = sum(ASTHDX == "1 YES")/sum(ASTHDX %in% c("1 YES","2 NO")),
ARTHDX = sum(ARTHDX == "1 YES")/sum(ARTHDX %in% c("1 YES","2 NO")),
CALUNG = sum(CALUNG == "1 YES")/sum(CALUNG %in% c("1 YES","2 NO")),
CHOLDX = sum(CHOLDX == "1 YES")/sum(CHOLDX %in% c("1 YES","2 NO")),
CHBRON31 = sum(CHBRON31 == "1 YES")/sum(CHBRON31 %in% c("1 YES","2 NO")),
DIABDX_M18 = sum(DIABDX_M18 == "1 YES")/sum(DIABDX_M18 %in% c("1 YES","2 NO")),
EMPHDX = sum(EMPHDX == "1 YES")/sum(EMPHDX %in% c("1 YES","2 NO")),
HIBPDX = sum(HIBPDX == "1 YES")/sum(HIBPDX %in% c("1 YES","2 NO")),
STRKDX = sum(STRKDX == "1 YES")/sum(STRKDX %in% c("1 YES","2 NO"))) %>%
pivot_longer(cols = health_disease_variables) %>%
rename(percentage = value , condition = name) %>%
mutate(condition = factor(condition,
levels = health_disease_variables,
labels = condition_labels))## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(health_disease_variables)
##
## # Now:
## data %>% select(all_of(health_disease_variables))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
diseases_percentages %>% filter(ADSMOK42 == "1 YES") %>%
ggplot(aes(x = condition, y = percentage*100, fill = condition)) +
geom_col() +
coord_flip() +
labs(
title = "Percentage of Smokers 18 or Older \nWho Report Specific Health Diseases",
x = "",
y = "Percentage"
) +
theme_classic() +
theme(legend.position = "none")diseases_percentages %>% filter(ADSMOK42 == "2 NO") %>%
ggplot(aes(x = condition, y = percentage*100, fill = condition)) +
geom_col() +
coord_flip() +
labs(
title = "Percentage of non-Smokers 18 or Older \nWho Report Specific Health Diseases",
x = "",
y = "Percentage"
) +
theme_classic() +
theme(legend.position = "none")diseases_percentages %>% filter(ADSMOK42 != "-1 INAPPLICABLE") %>%
mutate(ADSMOK42 = factor(ADSMOK42, levels = c("1 YES","2 NO"),
labels = c("Smokers","Non-Smokers"))) %>%
mutate(ADSMOK42 = relevel(ADSMOK42, "Non-Smokers")) %>%
ggplot(aes(x = condition, y = percentage*100, fill = ADSMOK42)) +
geom_col(position = position_dodge()) +
coord_flip() +
labs(
title = "Percentage of people 18 or Older Who Report Specific Diseases",
x = "",
y = "Percentage"
) +
theme(legend.title = element_blank())# recreate the plot above but with electronic nicotine
diseases_percentages = small_df %>% filter(AGELAST >= 18) %>%
select(SDENICPROD, all_of(health_disease_variables)) %>% group_by(SDENICPROD) %>%
summarise(ASTHDX = sum(ASTHDX == "1 YES")/sum(ASTHDX %in% c("1 YES","2 NO")),
ARTHDX = sum(ARTHDX == "1 YES")/sum(ARTHDX %in% c("1 YES","2 NO")),
CALUNG = sum(CALUNG == "1 YES")/sum(CALUNG %in% c("1 YES","2 NO")),
CHOLDX = sum(CHOLDX == "1 YES")/sum(CHOLDX %in% c("1 YES","2 NO")),
CHBRON31 = sum(CHBRON31 == "1 YES")/sum(CHBRON31 %in% c("1 YES","2 NO")),
DIABDX_M18 = sum(DIABDX_M18 == "1 YES")/sum(DIABDX_M18 %in% c("1 YES","2 NO")),
EMPHDX = sum(EMPHDX == "1 YES")/sum(EMPHDX %in% c("1 YES","2 NO")),
HIBPDX = sum(HIBPDX == "1 YES")/sum(HIBPDX %in% c("1 YES","2 NO")),
STRKDX = sum(STRKDX == "1 YES")/sum(STRKDX %in% c("1 YES","2 NO"))) %>%
pivot_longer(cols = health_disease_variables) %>%
rename(percentage = value , condition = name) %>%
mutate(condition = factor(condition,
levels = health_disease_variables,
labels = condition_labels))
diseases_percentages %>% filter(SDENICPROD %in% c("1 YES","2 NO")) %>%
mutate(SDENICPROD = factor(SDENICPROD, levels = c("1 YES","2 NO"),
labels = c("Users of E-nic. Prods","Non-Users"))) %>%
mutate(SDENICPROD = relevel(SDENICPROD, "Non-Users")) %>%
ggplot(aes(x = condition, y = percentage*100, fill = SDENICPROD)) +
geom_col(position = position_dodge()) +
coord_flip() +
labs(
title = "Disease Rates among Users of Electronic Nicotine Products",
x = "",
y = "Percentage"
) +
theme(legend.title = element_blank())# contingency tables and chisquared test for
# ADSMOK42
#starting with ADSMOK42 and physical health
titles_health_disease <- c(
ASTHDX = "Asthma Diagnosis",
ARTHDX = "Arthritis Diagnosis",
CALUNG = "Lung Cancer Diagnosis",
CHOLDX = "Cholesterol Diagnosis",
CHBRON31 = "Chronic Bronchitis Diagnosis",
DIABDX_M18 = "Diabetes Diagnosis",
EMPHDX = "Emphysema Diagnosis",
HIBPDX = "High Blood Pressure Diagnosis",
STRKDX = "Stroke Diagnosis"
)
for(variable in health_disease_variables){
if("1 YES" %in% levels(small_df[[variable]])){
print(titles_health_disease[variable])
print("Contingency table")
print(cont_tables(small_df, "ADSMOK42", variable)%>% round(digits=3))
print("testing hypothesis of no association")
print(cont_tables(small_df, "ADSMOK42", variable,F)%>% chisq.test(correct = F))
print("------------")
}
}## ASTHDX
## "Asthma Diagnosis"
## [1] "Contingency table"
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(explanatory_var)
##
## # Now:
## data %>% select(all_of(explanatory_var))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(var_of_interest)
##
## # Now:
## data %>% select(all_of(var_of_interest))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## ASTHDX
## ADSMOK42 1 YES 2 NO
## 1 YES 0.176 0.824
## 2 NO 0.146 0.854
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 15.62, df = 1, p-value = 7.743e-05
##
## [1] "------------"
## ARTHDX
## "Arthritis Diagnosis"
## [1] "Contingency table"
## ARTHDX
## ADSMOK42 1 YES 2 NO
## 1 YES 0.355 0.645
## 2 NO 0.329 0.671
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 6.784, df = 1, p-value = 0.009198
##
## [1] "------------"
## CALUNG
## "Lung Cancer Diagnosis"
## [1] "Contingency table"
## CALUNG
## ADSMOK42 1 YES 2 NO
## 1 YES 0.044 0.956
## 2 NO 0.025 0.975
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 3.6873, df = 1, p-value = 0.05483
##
## [1] "------------"
## CHOLDX
## "Cholesterol Diagnosis"
## [1] "Contingency table"
## CHOLDX
## ADSMOK42 1 YES 2 NO
## 1 YES 0.406 0.594
## 2 NO 0.389 0.611
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 2.6921, df = 1, p-value = 0.1008
##
## [1] "------------"
## CHBRON31
## "Chronic Bronchitis Diagnosis"
## [1] "Contingency table"
## CHBRON31
## ADSMOK42 1 YES 2 NO
## 1 YES 0.027 0.973
## 2 NO 0.014 0.986
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 27.302, df = 1, p-value = 1.74e-07
##
## [1] "------------"
## DIABDX_M18
## "Diabetes Diagnosis"
## [1] "Contingency table"
## DIABDX_M18
## ADSMOK42 1 YES 2 NO
## 1 YES 0.167 0.833
## 2 NO 0.151 0.849
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 4.4548, df = 1, p-value = 0.0348
##
## [1] "------------"
## EMPHDX
## "Emphysema Diagnosis"
## [1] "Contingency table"
## EMPHDX
## ADSMOK42 1 YES 2 NO
## 1 YES 0.053 0.947
## 2 NO 0.017 0.983
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 135.88, df = 1, p-value < 2.2e-16
##
## [1] "------------"
## HIBPDX
## "High Blood Pressure Diagnosis"
## [1] "Contingency table"
## HIBPDX
## ADSMOK42 1 YES 2 NO
## 1 YES 0.446 0.554
## 2 NO 0.408 0.592
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 13.933, df = 1, p-value = 0.0001894
##
## [1] "------------"
## STRKDX
## "Stroke Diagnosis"
## [1] "Contingency table"
## STRKDX
## ADSMOK42 1 YES 2 NO
## 1 YES 0.076 0.924
## 2 NO 0.050 0.950
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 29.652, df = 1, p-value = 5.169e-08
##
## [1] "------------"
Summary of findings from contingency tables:
Asthma diagnosis: the association is statistically significant
Arthritis: also statistically significant
Lung cancer: p-value is just above 5%, the significance is weaker
Cholesterol Diagnosis: Smokers do have higher rates than non-smokers, but it’s not statistically significant
Chronic Bronchitis: the association is statistically significant
Diabetes diagnosis: also statistically significant
Emphysema diagnosis: also statistically significant
High blood pressure: also statistically significant
Stroke diagnosis: also statistically significant
All the findings are promising: the contingency tables show strong association between smoking and various diseases. However this is not enough to draw conclusions: we need to control for other variables. We know that the percentage of smokers is different for different age brackets, now we will repeat the test but only selecting one age bracket at the time
chisq_df_diseases = data.frame(diseases = health_disease_variables,
p_val_full_data= rep(NA, length(health_disease_variables)),
p_val_65plus_data= rep(NA, length(health_disease_variables)),
p_val_45_64_data= rep(NA, length(health_disease_variables)))
for(i in 1:nrow(chisq_df_diseases)){
variable = chisq_df_diseases[i,"diseases"]
test_full = cont_tables(small_df,"ADSMOK42",variable, F) %>%
chisq.test(correct = F)
test_45_64 = cont_tables_age(small_df,"ADSMOK42",variable,"45-64",F) %>%
chisq.test(correct = F)
test_65plus = cont_tables_age(small_df,"ADSMOK42",variable,"65+",F) %>%
chisq.test(correct = F)
chisq_df_diseases[i,"p_val_full_data"] = test_full$p.val
chisq_df_diseases[i,"p_val_45_64_data"] = test_45_64$p.val
chisq_df_diseases[i,"p_val_65plus_data"] = test_65plus$p.val
}## Warning in chisq.test(., correct = F): Chi-squared approximation may be
## incorrect
chisq_df_diseases = chisq_df_diseases %>%
mutate(diseases = titles_health_disease[diseases]) %>%
pivot_longer(cols=c("p_val_full_data","p_val_65plus_data","p_val_45_64_data"))
ggplot(chisq_df_diseases, aes(x = name, y = value)) +
geom_col() + geom_hline(aes(yintercept = 0.05), col = "red") +
facet_wrap(~diseases, scales = "free") +
labs(title="Testing hypothesis of no association between Smoking and various diseases",
subtitle = "P-value of Chi squared test, controlling for age groups",
x = "", y ="")+
scale_x_discrete(labels = c("p_val_full_data" = "All patients",
"p_val_65plus_data" = "65 plus",
"p_val_45_64_data" = "45-64")) +
theme_minimal() + theme(axis.text.x = element_text(size = 7))We can also look at the odds ratio.
or_data_all = expand.grid(diseases = health_disease_variables,
age = "all",estimate = NA,
lower = NA, upper = NA)
for(i in 1:nrow(or_data_all)){
var_of_interest = as.character(or_data_all$diseases[i])
or = small_df %>%
select(ADSMOK42,var_of_interest)%>%
filter(ADSMOK42 %in% c("1 YES", "2 NO"),
.data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
mutate(across(c(1,2), factor)) %>%
table() %>% epitools::oddsratio(method = "wald", correction = F)
or_data_all[i,"estimate"] = or$measure[2,"estimate"]
or_data_all[i,"lower"] = or$measure[2,"lower"]
or_data_all[i,"upper"] = or$measure[2,"upper"]
}
or_data_45_64 = expand.grid(diseases = health_disease_variables,
age = "45_64",estimate = NA,
lower = NA, upper = NA)
for(i in 1:nrow(or_data_45_64)){
var_of_interest = as.character(or_data_45_64$diseases[i])
or = small_df %>% filter(age_group == "45-64") %>%
select(ADSMOK42,var_of_interest)%>%
filter(ADSMOK42 %in% c("1 YES", "2 NO"),
.data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
mutate(across(c(1,2), factor)) %>%
table() %>% epitools::oddsratio(method = "wald", correction = F)
or_data_45_64[i,"estimate"] = or$measure[2,"estimate"]
or_data_45_64[i,"lower"] = or$measure[2,"lower"]
or_data_45_64[i,"upper"] = or$measure[2,"upper"]
}## Warning in chisq.test(xx, correct = correction): Chi-squared approximation may
## be incorrect
or_data_65plus = expand.grid(diseases = health_disease_variables,
age = "65plus",estimate = NA,
lower = NA, upper = NA)
for(i in 1:nrow(or_data_65plus)){
var_of_interest = as.character(or_data_65plus$diseases[i])
or = small_df %>% filter(age_group == "65+") %>%
select(ADSMOK42,var_of_interest)%>%
filter(ADSMOK42 %in% c("1 YES", "2 NO"),
.data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
mutate(across(c(1,2), factor)) %>%
table() %>% epitools::oddsratio(method = "wald", correction = F)
or_data_65plus[i,"estimate"] = or$measure[2,"estimate"]
or_data_65plus[i,"lower"] = or$measure[2,"lower"]
or_data_65plus[i,"upper"] = or$measure[2,"upper"]
}
or_data = rbind(or_data_all,or_data_45_64,or_data_65plus)
or_data %>% mutate(diseases = titles_health_disease[diseases]) %>%
ggplot(aes(x = age)) + geom_point(aes(y = estimate)) +
geom_segment(aes(x = age, xend = age, y = lower, yend = upper))+
geom_hline(aes(yintercept = 1), col = "red")+
facet_wrap(~diseases, scales = "free") +
labs(title="Odds ratio of disease diagnosis for Smokers and Non-Smokers",
subtitle = "95% confidence intervals for different age groups",
x = "", y ="")+
scale_x_discrete(labels = c("all" = "All",
"65plus" = "65 and over" ,
"45_64" = "45-64")) +
theme_minimal() + theme(axis.text.x = element_text(size = 7))#writing functions to make contingency tables for multiple categories
table_multi_cat = function(data, explanatory_var, var_of_interest, cond_prob = T){
tbl = data %>%
select(explanatory_var, var_of_interest)%>%
filter(.data[[explanatory_var]] %in% c("1 YES", "2 NO"),
!.data[[var_of_interest]] %in% c("-15 CANNOT BE COMPUTED",
"-1 INAPPLICABLE",
"-8 DK","-7 REFUSED"))%>%
mutate(across(c(1,2), factor)) %>%
table()
if (cond_prob){
tbl = tbl %>% prop.table(margin = 1)
}
tbl
}
#this one takes two variables and an age group, allowing to check separately
#different ages
table_multi_cat_age = function(data, explanatory_var, var_of_interest,
age_bracket, cond_prob = T){
tbl = data %>% filter(age_group == age_bracket) %>%
select(explanatory_var,var_of_interest) %>%
filter(.data[[explanatory_var]] %in% c("1 YES", "2 NO"),
!.data[[var_of_interest]] %in% c("-15 CANNOT BE COMPUTED",
"-1 INAPPLICABLE",
"-8 DK","-7 REFUSED"))%>%
mutate(across(c(1,2), factor)) %>%
table()
if(cond_prob){
tbl = tbl %>% prop.table(margin = 1)
}
tbl
}
#plotting all the mental health variables
# "MNHLTH31"
table_multi_cat(small_df, "ADSMOK42", "MNHLTH31") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = MNHLTH31)) +
labs(title = "Contingency table: Perceived Mental Health status", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))#"ADPCFL42"
table_multi_cat(small_df, "ADSMOK42", "ADPCFL42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADPCFL42)) +
labs(subtitle = "During the past 4 weeks, how often people felt calm and peaceful",
title = "Contingency table: Smoking and feeling peaceful", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))#"ADDPRS42"
table_multi_cat(small_df, "ADSMOK42", "ADDPRS42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADDPRS42)) +
labs(subtitle = "How many days in the past 2 weeks people felt down/depressed/hopeless",
title = "Contingency table: Smoking and feeling down", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))# "ADHOPE42"
table_multi_cat(small_df, "ADSMOK42", "ADHOPE42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADHOPE42)) +
labs(subtitle = "During the past 30 days, how often people felt hopeless",
title = "Contingency table: Smoking and feeling hopeless", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))# "ADNERV42"
table_multi_cat(small_df, "ADSMOK42", "ADNERV42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADNERV42)) +
labs(subtitle = "During the past 30 days, how often people felt nervous",
title = "Contingency table: Smoking and feeling nervous", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))# "ADSAD42"
table_multi_cat(small_df, "ADSMOK42", "ADSAD42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADSAD42)) +
labs(subtitle = "During the past 30 days, how often people felt so sad that nothing could cheer the person up",
title = "Contingency table: Smoking and feeling sad", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))# looking at how the contingency table for perceived mental health and smoking
# is different in the different age groups
age_18_24 = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "18-24") %>% as_tibble %>%
mutate(age = "18-24")
age_25_34 = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "25-34") %>% as_tibble %>%
mutate(age = "25-34")
age_35_44 = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "35-44") %>% as_tibble %>%
mutate(age = "35-44")
age_45_64 = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "45-64") %>% as_tibble %>%
mutate(age = "45-64")
age_65plus = table_multi_cat_age(small_df, "ADSMOK42", "MNHLTH31", "65+") %>% as_tibble %>%
mutate(age = "65+")
all_tables = rbind(age_18_24,age_25_34,age_35_44,age_45_64,age_65plus)
all_tables %>% mutate(age = factor(age)) %>%
ggplot(aes(x = n, fill = MNHLTH31, y = ADSMOK42)) +
geom_col(orientation = "y") +
facet_grid(rows = vars(age))+
labs(title = "Perceived Mental Health and smoking in different age groups", y = "",
subtitle = "Contingency tables with conditional probability", x="frequency", fill="Perceived Mental Health") +
scale_y_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))+
theme(strip.text.y = element_text(angle = 0))# looking at how the contingency table for "ADDPRS42" and smoking
# is different in the different age groups
age_18_24 = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "18-24") %>% as_tibble %>%
mutate(age = "18-24")
age_25_34 = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "25-34") %>% as_tibble %>%
mutate(age = "25-34")
age_35_44 = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "35-44") %>% as_tibble %>%
mutate(age = "35-44")
age_45_64 = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "45-64") %>% as_tibble %>%
mutate(age = "45-64")
age_65plus = table_multi_cat_age(small_df, "ADSMOK42", "ADDPRS42", "65+") %>% as_tibble %>%
mutate(age = "65+")
all_tables = rbind(age_18_24,age_25_34,age_35_44,age_45_64,age_65plus)
all_tables %>% mutate(age = factor(age)) %>%
ggplot(aes(x = n, fill = ADDPRS42, y = ADSMOK42)) +
geom_col(orientation = "y") +
facet_grid(rows = vars(age))+
labs(subtitle = "How many days in the past 2 weeks people felt down/depressed/hopeless",
y = "", fill="How many days people felt down",
title = "Contingency table: Smoking and feeling down", x="frequency", ) +
scale_y_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))+
theme(strip.text.y = element_text(angle = 0))In general, we can say that the mental health of non-smokers is better across all indicators, and age doesn’t seem to have any impact on this. This is expected, because old age is not associated with worse mental health as strongly as it is associated with physical and general health.
#ADGENH42
table_multi_cat(small_df, "ADSMOK42", "ADGENH42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADGENH42)) +
labs(title = "Contingency table: Smoking and General health", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))#RTHLTH31
table_multi_cat(small_df, "ADSMOK42", "RTHLTH31") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = RTHLTH31)) +
labs(title = "Contingency table: Smoking and Perceived Health Status", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))#ADENGY42
table_multi_cat(small_df, "ADSMOK42", "ADENGY42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = ADENGY42)) +
labs(subtitle = "During the past 4 weeks, how many people had a lot of energy",
title = "Contingency table: Smoking and energy", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))#WLKLIM53
table_multi_cat(small_df, "ADSMOK42", "WLKLIM53") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = ADSMOK42, y = n, fill = WLKLIM53)) +
labs(title = "Contingency table: Smoking and Limitation in Physical Functioning", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Smokers",
"2 NO" = "Non Smokers"))We see what we were expecting: the general health of smokers is worse. Non-Smokers are healthier, feel healthier and have more energy. We also see that among smokers there is a much higher rate of limitation in physical functioning. It’s hard to look for a direct causality between the two: there are many interactions between different demographic factors. We can explore this deeper, we will look at this association in different age groups.
# odds ratio: smoking and walking limitations in different age groups
table_multi_cat_control = function(data, explanatory_var, var_of_interest, cond_prob = T){
tbl = data %>%
select(explanatory_var, var_of_interest,age_group)%>%
filter(.data[[explanatory_var]] %in% c("1 YES", "2 NO"),
!.data[[var_of_interest]] %in% c("-15 CANNOT BE COMPUTED",
"-1 INAPPLICABLE",
"-8 DK","-7 REFUSED"))%>%
mutate(across(c(1,2), factor)) %>%
table()
if (cond_prob){
tbl = tbl %>% prop.table(margin = 1)
}
tbl
}
table_multi_cat_control(small_df, "ADSMOK42", "WLKLIM53",F) %>% as_tibble() %>%
filter(age_group != "0-18") %>%
group_by(age_group) %>%
summarise(or_estimate = epitools::oddsratio(n,method = "wald",
correction = F)$measure[2,"estimate"],
or_lower = epitools::oddsratio(n,method = "wald",
correction = F)$measure[2,"lower"],
or_upper = epitools::oddsratio(n,method = "wald",
correction = F)$measure[2,"upper"]) %>%
mutate(age_group = factor(age_group)) %>%
ggplot() +
geom_point(aes(x = age_group, y = or_estimate)) +
geom_segment(aes(x = age_group,xend=age_group,
y = or_lower, yend=or_upper))+
geom_hline(aes(yintercept =1), col = "red") +
labs(title = "Smoking and Limitation in Physical Functioning across different ages",
subtitle = "95% confidence intervals for odds ratio",
x = "Age Group",
y="odds ratio")## Warning: There were 3 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `or_estimate = ...[]`.
## ℹ In group 1: `age_group = "18-24"`.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
This is an interesting graph: young people (under 35) tend to have good general health indicators regardless of their smoking habits. This changes as people start to age past that threshold. The next two age groups show a larger association between smoking and worse health than the oldest group. Is this suggesting that smoking makes the effects of aging start sooner? However once people reach old age, their health gets worse even if they don’t smoke.
# diseases association with electronic nicotine product
for(variable in health_disease_variables){
if("1 YES" %in% levels(small_df[[variable]])){
print(titles_health_disease[variable])
print("Contingency table")
print(cont_tables(small_df, "SDENICPROD", variable)%>% round(digits=3))
print("testing hypothesis of no association")
print(cont_tables(small_df, "SDENICPROD", variable,F)%>% chisq.test(correct = F))
print("------------")
}
}## ASTHDX
## "Asthma Diagnosis"
## [1] "Contingency table"
## ASTHDX
## SDENICPROD 1 YES 2 NO
## 1 YES 0.201 0.799
## 2 NO 0.144 0.856
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 52.75, df = 1, p-value = 3.788e-13
##
## [1] "------------"
## ARTHDX
## "Arthritis Diagnosis"
## [1] "Contingency table"
## ARTHDX
## SDENICPROD 1 YES 2 NO
## 1 YES 0.251 0.749
## 2 NO 0.331 0.669
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 61.669, df = 1, p-value = 4.064e-15
##
## [1] "------------"
## CALUNG
## "Lung Cancer Diagnosis"
## [1] "Contingency table"
## CALUNG
## SDENICPROD 1 YES 2 NO
## 1 YES 0.071 0.929
## 2 NO 0.026 0.974
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 12.572, df = 1, p-value = 0.0003916
##
## [1] "------------"
## CHOLDX
## "Cholesterol Diagnosis"
## [1] "Contingency table"
## CHOLDX
## SDENICPROD 1 YES 2 NO
## 1 YES 0.281 0.719
## 2 NO 0.389 0.611
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 102.68, df = 1, p-value < 2.2e-16
##
## [1] "------------"
## CHBRON31
## "Chronic Bronchitis Diagnosis"
## [1] "Contingency table"
## CHBRON31
## SDENICPROD 1 YES 2 NO
## 1 YES 0.024 0.976
## 2 NO 0.014 0.986
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 13.059, df = 1, p-value = 0.0003019
##
## [1] "------------"
## DIABDX_M18
## "Diabetes Diagnosis"
## [1] "Contingency table"
## DIABDX_M18
## SDENICPROD 1 YES 2 NO
## 1 YES 0.100 0.900
## 2 NO 0.155 0.845
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 50.32, df = 1, p-value = 1.306e-12
##
## [1] "------------"
## EMPHDX
## "Emphysema Diagnosis"
## [1] "Contingency table"
## EMPHDX
## SDENICPROD 1 YES 2 NO
## 1 YES 0.041 0.959
## 2 NO 0.020 0.980
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 40.451, df = 1, p-value = 2.016e-10
##
## [1] "------------"
## HIBPDX
## "High Blood Pressure Diagnosis"
## [1] "Contingency table"
## HIBPDX
## SDENICPROD 1 YES 2 NO
## 1 YES 0.310 0.690
## 2 NO 0.413 0.587
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 91.721, df = 1, p-value < 2.2e-16
##
## [1] "------------"
## STRKDX
## "Stroke Diagnosis"
## [1] "Contingency table"
## STRKDX
## SDENICPROD 1 YES 2 NO
## 1 YES 0.039 0.961
## 2 NO 0.054 0.946
## [1] "testing hypothesis of no association"
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 9.7534, df = 1, p-value = 0.00179
##
## [1] "------------"
Association between usage of Electronic Nicotine product and diseases: analysis with conditional contingency tables and chi squared tests Summary of key points:
-Arthritis: strong association but in the opposite direction: who uses e-cigs has a lower rate of Arthritis. Probably because of confounding effect of age. Young people are less likely to have Arthritis, but are much more likely to use e cigs
-Lung Cancer: strong association, correct direction
-Cholesterol: strong association, but in opposite direction: need to control for age like arthritis
-Chronic Bronchitis: strong association, correct direction
-Diabetes: strong association, correct direction
-Emphysema: strong association, correct direction
-High Blood Pressure: strong association, but in opposite direction: need to control for age like arthritis
-Stroke: strong association, correct direction
# I don't like this plot anymore, showing the p-value makes almost no sense
# it made sense for ADSMOK42 because smokers always had higher rates than non smokers
# however for SDENICPROD for some diseases users have lower rates than non users
# so I will try to plot the odds ratio instead, which will let us see in which direction
# there is association -Giuseppe
chisq_df_diseases = data.frame(diseases = health_disease_variables,
p_val_full_data= rep(NA, length(health_disease_variables)),
p_val_65plus_data= rep(NA, length(health_disease_variables)),
p_val_45_64_data= rep(NA, length(health_disease_variables)))
for(i in 1:nrow(chisq_df_diseases)){
variable = chisq_df_diseases[i,"diseases"]
test_full = cont_tables(small_df,"SDENICPROD",variable, F) %>%
chisq.test(correct = F)
test_45_64 = cont_tables_age(small_df,"SDENICPROD",variable,"45-64",F) %>%
chisq.test(correct = F)
test_65plus = cont_tables_age(small_df,"SDENICPROD",variable,"65+",F) %>%
chisq.test(correct = F)
chisq_df_diseases[i,"p_val_full_data"] = test_full$p.val
chisq_df_diseases[i,"p_val_45_64_data"] = test_45_64$p.val
chisq_df_diseases[i,"p_val_65plus_data"] = test_65plus$p.val
}## Warning in chisq.test(., correct = F): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(., correct = F): Chi-squared approximation may be
## incorrect
chisq_df_diseases = chisq_df_diseases %>%
rename(full_data = p_val_full_data,
age_65plus = p_val_65plus_data,
age_45_64 = p_val_45_64_data) %>%
mutate(diseases = titles_health_disease[diseases]) %>%
pivot_longer(cols=c("full_data","age_65plus","age_45_64"))
ggplot(chisq_df_diseases, aes(x = name, y = value)) +
geom_col() + geom_hline(aes(yintercept = 0.05), col = "red") +
facet_wrap(~diseases, scales = "free") +
labs(title="Testing hypothesis of no association between
Electronic Nicotine Product usage and various diseases",
subtitle = "P-value of Chi squared test, controlling for age groups",
x = "", y ="")+
scale_x_discrete(labels = c("full_data" = "All patients",
"age_65plus" = "ages 65 plus" ,
"age_45_64" = "ages 45-64")) +
theme_minimal() + theme(axis.text.x = element_text(size = 7))#odds ratio
#odds of diseases for users/ odds of diseases for non users
or_data_all = expand.grid(diseases = health_disease_variables,
age = "all",estimate = NA,
lower = NA, upper = NA)
for(i in 1:nrow(or_data_all)){
var_of_interest = as.character(or_data_all$diseases[i])
or = small_df %>%
select(SDENICPROD,var_of_interest)%>%
filter(SDENICPROD %in% c("1 YES", "2 NO"),
.data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
mutate(across(c(1,2), factor)) %>%
table() %>% epitools::oddsratio(method = "wald", correction = F)
or_data_all[i,"estimate"] = or$measure[2,"estimate"]
or_data_all[i,"lower"] = or$measure[2,"lower"]
or_data_all[i,"upper"] = or$measure[2,"upper"]
}
or_data_45_64 = expand.grid(diseases = health_disease_variables,
age = "45_64",estimate = NA,
lower = NA, upper = NA)
for(i in 1:nrow(or_data_45_64)){
var_of_interest = as.character(or_data_45_64$diseases[i])
or = small_df %>% filter(age_group == "45-64") %>%
select(SDENICPROD,var_of_interest)%>%
filter(SDENICPROD %in% c("1 YES", "2 NO"),
.data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
mutate(across(c(1,2), factor)) %>%
table() %>% epitools::oddsratio(method = "wald", correction = F)
or_data_45_64[i,"estimate"] = or$measure[2,"estimate"]
or_data_45_64[i,"lower"] = or$measure[2,"lower"]
or_data_45_64[i,"upper"] = or$measure[2,"upper"]
}## Warning in chisq.test(xx, correct = correction): Chi-squared approximation may
## be incorrect
or_data_65plus = expand.grid(diseases = health_disease_variables,
age = "65plus",estimate = NA,
lower = NA, upper = NA)
for(i in 1:nrow(or_data_65plus)){
var_of_interest = as.character(or_data_65plus$diseases[i])
or = small_df %>% filter(age_group == "65+") %>%
select(SDENICPROD,var_of_interest)%>%
filter(SDENICPROD %in% c("1 YES", "2 NO"),
.data[[var_of_interest]] %in% c("1 YES", "2 NO")) %>%
mutate(across(c(1,2), factor)) %>%
table() %>% epitools::oddsratio(method = "wald", correction = F)
or_data_65plus[i,"estimate"] = or$measure[2,"estimate"]
or_data_65plus[i,"lower"] = or$measure[2,"lower"]
or_data_65plus[i,"upper"] = or$measure[2,"upper"]
}## Warning in chisq.test(xx, correct = correction): Chi-squared approximation may
## be incorrect
or_data = rbind(or_data_all,or_data_45_64,or_data_65plus)
or_data %>% mutate(diseases = titles_health_disease[diseases]) %>%
ggplot(aes(x = age)) + geom_point(aes(y = estimate)) +
geom_segment(aes(x = age, xend = age, y = lower, yend = upper))+
geom_hline(aes(yintercept = 1), col = "red")+
facet_wrap(~diseases, scales = "free") +
labs(title="Odds Ratio of Disease Diagnosis for Users and Non-Users of E-Nic. Prods",
subtitle = "95% confidence intervals for different age groups",
x = "", y ="")+
scale_x_discrete(labels = c("all" = "All",
"65plus" = "65 and over" ,
"45_64" = "45-64")) +
theme_minimal() + theme(axis.text.x = element_text(size = 7))#ADGENH42
table_multi_cat(small_df, "SDENICPROD", "ADGENH42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADGENH42)) +
labs(title = "Contingency table: Electronic nicotine product usage and General health", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Users",
"2 NO" = "Non users"))#RTHLTH31
table_multi_cat(small_df, "SDENICPROD", "RTHLTH31") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = RTHLTH31)) +
labs(title = "Contingency table: Electronic nicotine product usage and Perceived Health Status", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "Users",
"2 NO" = "Non Users"))#ADENGY42
table_multi_cat(small_df, "SDENICPROD", "ADENGY42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADENGY42)) +
labs(subtitle = "During the past 4 weeks, how many people had a lot of energy",
title = "Contingency table: Electronic nicotine product usage and energy", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "users",
"2 NO" = "Non users"))#WLKLIM53
table_multi_cat(small_df, "SDENICPROD", "WLKLIM53") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = WLKLIM53)) +
labs(title = "Electronic nicotine product usage and Limitation in Physical Functioning",
subtitle = "Contingency table with conditional probability",
x = "",
y="frequency", fill="Walking Limitation") +
scale_x_discrete(labels = c("1 YES" = "users",
"2 NO" = "Non users"))+
scale_fill_discrete(labels = c("Yes", "No"))# odds ratio: E-nic usage and walking limitations in different age groups
table_multi_cat_control(small_df, "SDENICPROD", "WLKLIM53",F) %>% as_tibble() %>%
filter(age_group != "0-18") %>%
group_by(age_group) %>%
summarise(or_estimate = epitools::oddsratio(n,method = "wald",
correction = F)$measure[2,"estimate"],
or_lower = epitools::oddsratio(n,method = "wald",
correction = F)$measure[2,"lower"],
or_upper = epitools::oddsratio(n,method = "wald",
correction = F)$measure[2,"upper"]) %>%
mutate(age_group = factor(age_group)) %>%
ggplot() +
geom_point(aes(x = age_group, y = or_estimate)) +
geom_segment(aes(x = age_group,xend=age_group,
y = or_lower, yend=or_upper))+
geom_hline(aes(yintercept =1), col = "red") +
labs(title = "E-nicotine usage and Limitation in Physical Functioning across different ages",
subtitle = "95% confidence intervals for odds ratio",
x = "Age Group",
y="odds ratio")## Warning: There were 3 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `or_estimate = ...[]`.
## ℹ In group 1: `age_group = "18-24"`.
## Caused by warning in `chisq.test()`:
## ! Chi-squared approximation may be incorrect
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.
# "MNHLTH31"
table_multi_cat(small_df, "SDENICPROD", "MNHLTH31") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = MNHLTH31)) +
labs(title = "electronic nicotine product usage and Perceived Mental Health status", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "users",
"2 NO" = "Non users"))#"ADPCFL42"
table_multi_cat(small_df, "SDENICPROD", "ADPCFL42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADPCFL42)) +
labs(subtitle = "During the past 4 weeks, how often people felt calm and peaceful",
title = "electronic nicotine product usage and feeling peaceful", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "users",
"2 NO" = "Non users"))#"ADDPRS42"
table_multi_cat(small_df, "SDENICPROD", "ADDPRS42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADDPRS42)) +
labs(subtitle = "How many days in the past 2 weeks people felt down/depressed/hopeless",
title = "Contingency table: electronic nicotine product usage and feeling down", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "users",
"2 NO" = "Non users"))# "ADHOPE42"
table_multi_cat(small_df, "SDENICPROD", "ADHOPE42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADHOPE42)) +
labs(subtitle = "During the past 30 days, how often people felt hopeless",
title = "electronic nicotine product usage and feeling hopeless", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "users",
"2 NO" = "Non users"))# "ADNERV42"
table_multi_cat(small_df, "SDENICPROD", "ADNERV42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADNERV42)) +
labs(subtitle = "During the past 30 days, how often people felt nervous",
title = "electronic nicotine product usage and feeling nervous", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "users",
"2 NO" = "Non users"))# "ADSAD42"
table_multi_cat(small_df, "SDENICPROD", "ADSAD42") %>% as_tibble() %>%
ggplot() + geom_col(aes(x = SDENICPROD, y = n, fill = ADSAD42)) +
labs(subtitle = "During the past 30 days, how often people felt so sad that nothing could cheer the person up",
title = "electronic nicotine product usage and feeling sad", x = "",
y="frequency", fill="") +
scale_x_discrete(labels = c("1 YES" = "users",
"2 NO" = "Non users"))# mental health controlling for age
# looking at how the contingency table for perceived mental health and e-nic. usage
# is different in the different age groups
age_18_24 = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "18-24") %>% as_tibble %>%
mutate(age = "18-24")
age_25_34 = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "25-34") %>% as_tibble %>%
mutate(age = "25-34")
age_35_44 = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "35-44") %>% as_tibble %>%
mutate(age = "35-44")
age_45_64 = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "45-64") %>% as_tibble %>%
mutate(age = "45-64")
age_65plus = table_multi_cat_age(small_df, "SDENICPROD", "MNHLTH31", "65+") %>% as_tibble %>%
mutate(age = "65+")
all_tables = rbind(age_18_24,age_25_34,age_35_44,age_45_64,age_65plus)
all_tables %>% mutate(age = factor(age)) %>%
ggplot(aes(x = n, fill = MNHLTH31, y = SDENICPROD)) +
geom_col(orientation = "y") +
facet_grid(rows = vars(age))+
labs(title = "Perceived Mental Health and E-Nic. in different age groups", y = "",
subtitle = "Contingency tables with conditional probability", x="frequency", fill="Perceived Mental Health") +
scale_y_discrete(labels = c("1 YES" = "Users",
"2 NO" = "Non-Users"))+
theme(strip.text.y = element_text(angle = 0))Summary:
the general health of e-nicotine products users is the not much lower than that of non users. However there is a bigger gap in the mental health of the two groups: in the data we see that non users tend to have higher values of good mental health
Now we started to fit some models for Smoking and E-Smoking. We decided to analyse the two variables in distinct ways because… We decide to start with the smoking predictor. We want to keep only AGE, SEX and AGEGROUP as other predictors.
values_to_remove <- c("-15 CANNOT BE COMPUTED", "-8 DK", "-7 REFUSED", "-1 INAPPLICABLE")
small_df = small_df %>% mutate(AGEGROUP = age_group)filter_df <- small_df %>% select(ARTHDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$ARTHDX <- relevel(filter_df$ARTHDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(ARTHDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = ARTHDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.275674 0.094329 -55.928 < 2e-16 ***
## ADSMOK421 YES 0.182979 0.076769 2.384 0.01715 *
## SEX2 FEMALE 0.580436 0.044777 12.963 < 2e-16 ***
## AGELAST 0.072394 0.001385 52.257 < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE 0.340559 0.101467 3.356 0.00079 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19607 on 15389 degrees of freedom
## Residual deviance: 15400 on 15385 degrees of freedom
## AIC: 15410
##
## Number of Fisher Scoring iterations: 5
data_point <- filter_df %>% group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(ARTHDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithCancer/tot)## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))
# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of ARTHDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()All our variables are significant. Arthritis is characterized by a positive correlation with smoking (more pronounced in women), sex and age.
filter_df <- small_df %>% select(DIABDX_M18, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$DIABDX_M18 <- relevel(filter_df$DIABDX_M18, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(DIABDX_M18 ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = DIABDX_M18 ~ ADSMOK42 * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.201238 0.101001 -41.596 < 2e-16 ***
## ADSMOK421 YES 0.029254 0.087979 0.333 0.739500
## SEX2 FEMALE -0.152259 0.051660 -2.947 0.003205 **
## AGELAST 0.043263 0.001486 29.117 < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE 0.408440 0.117066 3.489 0.000485 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13262 on 15417 degrees of freedom
## Residual deviance: 12230 on 15413 degrees of freedom
## AIC: 12240
##
## Number of Fisher Scoring iterations: 5
data_point <- filter_df %>% group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(DIABDX_M18 == "1 YES"), tot = n()) %>%
mutate(frequency = WithCancer/tot)## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))
# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of DIABDX_M18 by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()+
ylim(c(0, 0.6))## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
In this case, looking at smoking in general is not significant, but if we control for sex, in the female there is definitely a correlation. Diabetes is characterized by a positive correlation with smoking (significant in women) and with age, and a negative one with being female.
filter_df <- small_df %>% select(EMPHDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$EMPHDX <- relevel(filter_df$EMPHDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(EMPHDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = EMPHDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.777000 0.307674 -25.277 < 2e-16 ***
## ADSMOK421 YES 1.104037 0.169023 6.532 6.5e-11 ***
## SEX2 FEMALE -0.366242 0.141096 -2.596 0.00944 **
## AGELAST 0.061537 0.004193 14.675 < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE 0.644011 0.224886 2.864 0.00419 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3400.4 on 15388 degrees of freedom
## Residual deviance: 2994.3 on 15384 degrees of freedom
## AIC: 3004.3
##
## Number of Fisher Scoring iterations: 7
data_point <- filter_df %>% group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(EMPHDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithCancer/tot)## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))
# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of EMPHDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()+
ylim(c(0, 0.35))## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
All our variables are significant. Emphysema is characterized by a positive correlation with smoking and age (more pronounced in smokers), and a negative one for being female.
filter_df <- small_df %>% select(STRKDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$STRKDX <- relevel(filter_df$STRKDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(STRKDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = STRKDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.584584 0.195221 -33.729 < 2e-16 ***
## ADSMOK421 YES 0.540824 0.125410 4.312 1.61e-05 ***
## SEX2 FEMALE -0.196109 0.083610 -2.346 0.019 *
## AGELAST 0.060109 0.002713 22.155 < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE 0.240915 0.168872 1.427 0.154
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6558.6 on 15388 degrees of freedom
## Residual deviance: 5871.2 on 15384 degrees of freedom
## AIC: 5881.2
##
## Number of Fisher Scoring iterations: 6
data_point <- filter_df %>% group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(STRKDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithCancer/tot)## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))
# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of STRKDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()+
ylim(c(0, 0.4))## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
All our variables are significant, except for the interaction between smoking and sex. Stroke is characterized by a positive correlation with smoking and age, and a negative one for being female.
filter_df <- small_df %>% select(CALUNG, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$CALUNG <- relevel(filter_df$CALUNG, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(CALUNG ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = CALUNG ~ ADSMOK42 * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.06890 0.96416 -6.295 3.08e-10 ***
## ADSMOK421 YES 1.27743 0.44114 2.896 0.00378 **
## SEX2 FEMALE 0.24805 0.30244 0.820 0.41212
## AGELAST 0.03153 0.01263 2.497 0.01252 *
## ADSMOK421 YES:SEX2 FEMALE -0.96393 0.63175 -1.526 0.12706
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 566.73 on 2233 degrees of freedom
## Residual deviance: 553.81 on 2229 degrees of freedom
## AIC: 563.81
##
## Number of Fisher Scoring iterations: 6
data_point <- filter_df %>% group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithCancer = sum(CALUNG == "1 YES"), tot = n()) %>%
mutate(frequency = WithCancer/tot)## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))
# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of CALUNG by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()+
ylim(c(0, 0.35))## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
In this case, both age and interaction between smoking and sex seem to be not significant. However we can observe that Lung Cancer is characterized by a positive correlation with both smoking and age, and this correlations are more pronounced for men.
filter_df <- small_df %>% select(ASTHDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == c("18-24"))
filter_df$ASTHDX <- relevel(filter_df$ASTHDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(ASTHDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = ASTHDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.671965 0.076252 -21.927 < 2e-16 ***
## ADSMOK421 YES 0.041146 0.089118 0.462 0.64430
## SEX2 FEMALE 0.285653 0.051757 5.519 3.41e-08 ***
## AGELAST -0.004836 0.001230 -3.933 8.38e-05 ***
## ADSMOK421 YES:SEX2 FEMALE 0.313916 0.114266 2.747 0.00601 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13120 on 15420 degrees of freedom
## Residual deviance: 13025 on 15416 degrees of freedom
## AIC: 13035
##
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>% group_by(AGEGROUP,AGELAST, SEX, ADSMOK42) %>% summarise(WithAsthma = sum(ASTHDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithAsthma/tot) ## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 95))
# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of ASTHDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))Also in this case, looking at smoking in general is not significant, but if we look at the interaction between smoking and sex, there is definitely a correlation. Asthma is quite particular, because is the only one that is characterized by a negative correlation with age. Also in this case we have a positive correlation with smoking (significant in female) and with sex.
filter_df <- small_df %>% select(CHOLDX, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$CHOLDX <- relevel(filter_df$CHOLDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(CHOLDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = CHOLDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.744208 0.075692 -49.466 < 2e-16 ***
## ADSMOK421 YES 0.060372 0.068518 0.881 0.378254
## SEX2 FEMALE -0.174565 0.041336 -4.223 2.41e-05 ***
## AGELAST 0.059723 0.001189 50.227 < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE 0.317587 0.093992 3.379 0.000728 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20600 on 15380 degrees of freedom
## Residual deviance: 17303 on 15376 degrees of freedom
## AIC: 17313
##
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>% group_by(AGELAST, SEX, ADSMOK42) %>% summarise(WithChol = sum(CHOLDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithChol/tot) ## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
ggplot(prediction, aes(x = AGELAST, y = Yval, col = ADSMOK42)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of CHOLDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()
Also in this case, looking at smoking in general is not significant, as
we see in the interaction with sex, in the female there is definitely a
correlation. High Cholesterol is characterized by a positive correlation
with smoking (significant in women) and with age, and a negative one
with being female.
filter_df <- small_df %>% select(CHBRON31, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$CHBRON31 <- relevel(filter_df$CHBRON31, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(CHBRON31 ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = CHBRON31 ~ ADSMOK42 * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.540038 0.261348 -21.198 < 2e-16 ***
## ADSMOK421 YES 0.383047 0.261141 1.467 0.1424
## SEX2 FEMALE 0.399067 0.161999 2.463 0.0138 *
## AGELAST 0.017757 0.003786 4.690 2.73e-06 ***
## ADSMOK421 YES:SEX2 FEMALE 0.576253 0.308866 1.866 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2537.5 on 15328 degrees of freedom
## Residual deviance: 2467.1 on 15324 degrees of freedom
## AIC: 2477.1
##
## Number of Fisher Scoring iterations: 7
data_point <- filter_df %>% group_by(AGELAST, SEX, ADSMOK42) %>% summarise(WithBronch = sum(CHBRON31 == "1 YES"), tot = n()) %>%
mutate(frequency = WithBronch/tot) ## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")
# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = ADSMOK42)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of CHBRON31 by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))In this case, both age and interaction between smoking and sex seem to be not so significant (not at all in general and really low for the interaction). However we can observe that Chronic Bronchitis is characterized by a positive correlation with both age and smoking (more pronunced a little bit more pronunced for female).
filter_df <- small_df %>% select(HIBPDX, ADSMOK42, SEX, AGELAST) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$HIBPDX <- relevel(filter_df$HIBPDX, ref = "2 NO")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(HIBPDX ~ ADSMOK42 * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = HIBPDX ~ ADSMOK42 * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.972786 0.077436 -51.304 < 2e-16 ***
## ADSMOK421 YES 0.250575 0.068732 3.646 0.000267 ***
## SEX2 FEMALE -0.230722 0.041966 -5.498 3.84e-08 ***
## AGELAST 0.065871 0.001229 53.612 < 2e-16 ***
## ADSMOK421 YES:SEX2 FEMALE 0.189444 0.094694 2.001 0.045437 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20879 on 15382 degrees of freedom
## Residual deviance: 16958 on 15378 degrees of freedom
## AIC: 16968
##
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>% group_by(AGELAST, SEX, ADSMOK42) %>% summarise(WithPress = sum(HIBPDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithPress/tot) ## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")
# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = ADSMOK42)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of HIBPDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))In this case, all our variables are significant. High Blood Pressure is characterized by a positive correlation with smoking and with age, and a negative one with being female.
As we can say we obtained a lot of different results, both negative and positive, but we can see that for some models there is a direct correlation with AGE. Also some plots are quite similar (like the one for Cholesterol and Blood pressure) So there could be some connection between those disease. To prove this we can try if some physical diseases have some of connection.
Now to evaluate the mental health status we need to change the type of the regression. In the first part we used the Logistic Regression, but for this part we need to use the Multinomial Logistic Regression. MLR is a classification method that modify logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes (I promise, is just a case that is very similar to the Wikipedia description).
filter_df <-small_df %>% select(MNHLTH31, ADSMOK42, SEX, AGELAST) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$MNHLTH31 <- relevel(filter_df$MNHLTH31, ref = "1 EXCELLENT")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
filter_df$MNHLTH31 = ordered(filter_df$MNHLTH31)fit <- vglm(MNHLTH31 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = filter_df)
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
matplot(df_pred$AGELAST, predict(fit, df_pred,type = "response"), type = "l",
ylab = "Predicted Probability", xlab = "Age", lwd = 2) +
title("Mental Health by Age") +
grid()## integer(0)
Here we can see the general distribution of the Perceived Mental Health Status for every age. As we see, the majority of the people are distributed between the 3 best cases, with the “very good” one being the most frequent.
null <- vglm(MNHLTH31 ~ 1,
family = cumulative(parallel = TRUE),
data = filter_df)
pchisq(deviance(null)-deviance(fit), df <- df.residual(null)-df.residual(fit), lower.tail=FALSE)## [1] 0.194099
Then we fitted the model with also SEX
fit <- vglm(MNHLTH31 ~ ADSMOK42 + SEX + AGELAST,
family = cumulative(parallel = TRUE),
data = filter_df)
summary(fit)##
## Call:
## vglm(formula = MNHLTH31 ~ ADSMOK42 + SEX + AGELAST, family = cumulative(parallel = TRUE),
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.6862955 0.0494676 -13.874 <2e-16 ***
## (Intercept):2 0.7047961 0.0494543 14.251 <2e-16 ***
## (Intercept):3 2.4488090 0.0539677 45.375 <2e-16 ***
## (Intercept):4 4.2516580 0.0747581 56.872 <2e-16 ***
## ADSMOK421 YES -0.6026349 0.0379176 -15.893 <2e-16 ***
## SEX2 FEMALE -0.2528145 0.0293955 -8.600 <2e-16 ***
## AGELAST -0.0013227 0.0007976 -1.658 0.0972 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
##
## Residual deviance: 41991.95 on 61629 degrees of freedom
##
## Log-likelihood: -20995.97 on 61629 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):4'
##
##
## Exponentiated coefficients:
## ADSMOK421 YES SEX2 FEMALE AGELAST
## 0.5473675 0.7766120 0.9986782
Oh no! There is a problem in our regression, what is that? The Hauck-Donner effect? It reminds me a name of a music instrument brand…never mind.
The Hauck-Donner effect is a problem that can occur if we use small data-frame, indicates that the significance of our model can be false. But how can we solve it? Don’t worry, is very simple.
We bootstrapped the original data-frame for 100 times (better more but our pc are not so powerful).
data <- filter_df
formula <- MNHLTH31 ~ ADSMOK42 + SEX + AGELAST
fit_model <- function(data, indices) {
boot_data <- data[indices, ]
model <- vglm(formula,
family = cumulative(parallel = TRUE),
data = boot_data)
return(coef(model))
}
set.seed(123)
bootstrap_results <- boot(data = data, statistic = fit_model, R = 100)
bootstrap_results##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = fit_model, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -0.686295534 8.293538e-04 0.0481159045
## t2* 0.704796145 2.315522e-03 0.0512439628
## t3* 2.448809015 4.123762e-03 0.0553600060
## t4* 4.251657965 4.672130e-03 0.0857172943
## t5* -0.602634909 -2.362184e-03 0.0409388089
## t6* -0.252814473 -5.483153e-04 0.0285605837
## t7* -0.001322697 -1.281613e-06 0.0008524966
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap_results, type = "basic")
##
## Intervals :
## Level Basic
## 95% (-0.7822, -0.5829 )
## Calculations and Intervals on Original Scale
## Some basic intervals may be unstable
Fortunately the bootstrap confirm the robustness of our model.
Then we compared the smokers and non smokers filtering the dataset for this variable and fitting 2 models.
smk_df = filter_df %>% filter(ADSMOK42 == "1 YES")
n_smk_df = filter_df %>% filter(ADSMOK42 == "2 NO")# For smokers
fit <- vglm(MNHLTH31 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = smk_df)
df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred)
cum_probabilities = t(apply(preds,1,plogis))
par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2,
ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft",
legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.08))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
legend = sprintf("Pr[Y = %g]", 1:5),
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.69))
mtext("Perceived Mental Health Status for Smokers", outer = TRUE, cex = 1.5, line = -3.5)# For non smokers
fit <- vglm(MNHLTH31 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = n_smk_df)
df_pred = data.frame(AGELAST = seq(min(n_smk_df$AGELAST), max(n_smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) # check that they sum to 1
cum_probabilities = t(apply(preds,1,plogis))
par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2,
ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft",
legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.14))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
legend = sprintf("Pr[Y = %g]", 1:5),
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.37))
mtext("Perceived Mental Health Status for Non-Smokers", outer = TRUE, cex = 1.5, line = -3)From this 2 groups of plots we see that non-smokers tend to have better Perceived Mental Health. The majority of the non-smoking population perceives their mental health as “very good,” with the best category (“excellent,” shown in black) being the second most prevalent up to age 56. In contrast, among smokers, the most common perception is “good,” followed by “very good.” “Excellent” ranks only third.
Here, we have performed the same analysis for the “General Health” variable.
filter_df <-small_df %>% select(ADGENH42, ADSMOK42, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$ADGENH42 <- relevel(filter_df$ADGENH42, ref = "1 EXCELLENT")
filter_df$ADSMOK42 <- relevel(filter_df$ADSMOK42, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- vglm(ADGENH42 ~ AGELAST , data = filter_df, family = multinomial)
# summary(fit)
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), ADSMOK42 = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
matplot(df_pred$AGELAST, predict(fit, df_pred,type = "response"), type = "l",
ylab = "Predicted Probability", xlab = "Age", lwd = 2)
title("General Health by Age")
grid()
legend("bottomleft",legend = sort(unique(filter_df$ADGENH42)), col = 1:5, lty = 1:5, lwd = 2, cex = 0.6, inset = c(0, 0.25))The General Health plot is quite different from the previous one, as we see that the effect of age is stronger. In this case we see a strong decreasing in the best cases (Very good and Excellent) and the increase of the others.
## 2.5 % 97.5 %
## (Intercept):1 4.26357682 5.1100094402
## (Intercept):2 4.07383515 4.9055471321
## (Intercept):3 3.32900825 4.1641201093
## (Intercept):4 1.64751697 2.5242616426
## AGELAST:1 -0.05768480 -0.0441035318
## AGELAST:2 -0.03782884 -0.0247116562
## AGELAST:3 -0.02556620 -0.0124293833
## AGELAST:4 -0.01309841 0.0006783151
null = vglm(ADGENH42 ~ 1, family = multinomial, data = filter_df)
print(pseudo_R2 <- 1 - deviance(fit) / deviance(null))## [1] 0.02186083
## Analysis of Deviance Table (Type II tests)
##
## Model: 'multinomial', 'VGAMcategorical'
##
## Link: 'multilogitlink'
##
## Response: ADGENH42
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## AGELAST 4 904.6 59756 41380 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Than we added the sex variable also in this case and we did some plots
filter_df$ADGENH42 = ordered(filter_df$ADGENH42)
fit <- vglm(ADGENH42 ~ ADSMOK42 + SEX + AGELAST,
family = cumulative(parallel = TRUE),
data = filter_df)
summary(fit)##
## Call:
## vglm(formula = ADGENH42 ~ ADSMOK42 + SEX + AGELAST, family = cumulative(parallel = TRUE),
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.1321573 0.0504790 -2.618 0.00884 **
## (Intercept):2 1.6604184 0.0519717 31.949 < 2e-16 ***
## (Intercept):3 3.3748393 0.0573363 58.860 < 2e-16 ***
## (Intercept):4 5.4332679 0.0762192 71.285 < 2e-16 ***
## ADSMOK421 YES -0.7088684 0.0389372 -18.205 < 2e-16 ***
## SEX2 FEMALE -0.1800449 0.0300423 -5.993 2.06e-09 ***
## AGELAST -0.0252239 0.0008396 -30.043 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
##
## Residual deviance: 40151.52 on 59753 degrees of freedom
##
## Log-likelihood: -20075.76 on 59753 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):4'
##
##
## Exponentiated coefficients:
## ADSMOK421 YES SEX2 FEMALE AGELAST
## 0.4922008 0.8352327 0.9750915
data <- filter_df
formula <- ADGENH42 ~ ADSMOK42 + SEX + AGELAST
fit_model <- function(data, indices) {
boot_data <- data[indices, ]
model <- vglm(formula,
family = cumulative(parallel = TRUE),
data = boot_data)
return(coef(model))
}
set.seed(123)
bootstrap_results <- boot(data = data, statistic = fit_model, R = 100)
bootstrap_results##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = fit_model, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -0.13215726 -4.730256e-03 0.0561873669
## t2* 1.66041835 -2.431014e-04 0.0568043897
## t3* 3.37483932 1.344731e-03 0.0617590369
## t4* 5.43326790 -1.390854e-03 0.0801421083
## t5* -0.70886842 5.612468e-03 0.0420773001
## t6* -0.18004489 2.424876e-04 0.0292357120
## t7* -0.02522394 -3.782614e-05 0.0009120621
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap_results, type = "basic")
##
## Intervals :
## Level Basic
## 95% (-0.2509, -0.0156 )
## Calculations and Intervals on Original Scale
## Some basic intervals may be unstable
Also in this case we checked if our model was robust before splitting in smokers and non-smokers, and it was.
smk_df = filter_df %>% filter(ADSMOK42 == "1 YES")
n_smk_df = filter_df %>% filter(ADSMOK42 == "2 NO")# For smokers
fit <- vglm(ADGENH42 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = smk_df)
df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred)
cum_probabilities = t(apply(preds,1,plogis))
par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2,
ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft",
legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.4))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
legend = sprintf("Pr[Y = %g]", 1:5),
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.08))
mtext("General Health Status for Smokers", outer = TRUE, cex = 1.5, line = -3.5)# For not smokers
fit <- vglm(ADGENH42 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = n_smk_df)
df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred)
cum_probabilities = t(apply(preds,1,plogis))
par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2,
ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft",
legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.4))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
legend = sprintf("Pr[Y = %g]", 1:5),
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.08))
mtext("General Health Status for NOT Smokers", outer = TRUE, cex = 1.5, line = -3.5)Also in this case we see that non-smokers tend to have better General Health. Non smokers are starting from higher General Health categories, and are inverting with worst ones later in the years respect to smokers. We can observe that as years pass, the most of the smoker are distributing between the “Good” and “Fair” categories, with also the “Poor” one surpassing the “Excellent” one. For non smokers this inversion never happens, and later in the years, the most of the people are still distributing between the “Very Good” and “Good” categories.
Form now on we will repeat the same steps for the E-smoking
filter_df <- small_df %>% select(STRKDX , SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$STRKDX <- relevel(filter_df$STRKDX , ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(STRKDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = STRKDX ~ SDENICPROD * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.500144 0.177679 -36.584 <2e-16 ***
## SDENICPROD1 YES 0.223148 0.169500 1.317 0.1880
## SEX2 FEMALE -0.179755 0.071998 -2.497 0.0125 *
## AGELAST 0.060259 0.002499 24.118 <2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE 0.349832 0.226808 1.542 0.1230
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7480.7 on 18179 degrees of freedom
## Residual deviance: 6708.6 on 18175 degrees of freedom
## AIC: 6718.6
##
## Number of Fisher Scoring iterations: 6
data_point <- filter_df %>% group_by(AGEGROUP,AGELAST, SEX, SDENICPROD) %>% summarise(WithStroke = sum(STRKDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithStroke/tot)## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
# Creating the prediction data frame
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 95))
# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = SDENICPROD)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of STRKDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()filter_df <- small_df %>% select(ASTHDX, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == c("18-24"))
filter_df$ASTHDX <- relevel(filter_df$ASTHDX, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(ASTHDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = ASTHDX ~ SDENICPROD * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.795164 0.072533 -24.750 < 2e-16 ***
## SDENICPROD1 YES 0.309304 0.086392 3.580 0.000343 ***
## SEX2 FEMALE 0.317237 0.046701 6.793 1.1e-11 ***
## AGELAST -0.003152 0.001169 -2.697 0.006999 **
## SDENICPROD1 YES:SEX2 FEMALE 0.138970 0.112743 1.233 0.217715
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15498 on 18180 degrees of freedom
## Residual deviance: 15375 on 18176 degrees of freedom
## AIC: 15385
##
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>% group_by(AGEGROUP,AGELAST, SEX, SDENICPROD) %>% summarise(WithAsthma = sum(ASTHDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithAsthma/tot) ## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 95))
# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = SDENICPROD)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of ASTHDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()filter_df <- small_df %>% select(ARTHDX , SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == c("18-24"))
filter_df$ARTHDX <- relevel(filter_df$ARTHDX , ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(ARTHDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = ARTHDX ~ SDENICPROD * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.37596 0.08785 -61.194 < 2e-16 ***
## SDENICPROD1 YES 0.44257 0.08876 4.986 6.16e-07 ***
## SEX2 FEMALE 0.61273 0.03999 15.323 < 2e-16 ***
## AGELAST 0.07392 0.00130 56.841 < 2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE 0.16719 0.11586 1.443 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22791 on 18178 degrees of freedom
## Residual deviance: 17880 on 18174 degrees of freedom
## AIC: 17890
##
## Number of Fisher Scoring iterations: 5
data_point <- filter_df %>% group_by(AGEGROUP,AGELAST, SEX, SDENICPROD) %>% summarise(WithAsthma = sum(ARTHDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithAsthma/tot) ## `summarise()` has grouped output by 'AGEGROUP', 'AGELAST', 'SEX'. You can
## override using the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 95))
# Predicting values using the model
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
# Plotting the results
ggplot(prediction, aes(x = AGELAST, y = Yval, col = SDENICPROD)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of ARTHDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()filter_df <- small_df %>% select(CHOLDX, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$CHOLDX <- relevel(filter_df$CHOLDX, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(CHOLDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = CHOLDX ~ SDENICPROD * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.80149 0.07078 -53.707 < 2e-16 ***
## SDENICPROD1 YES 0.15384 0.07640 2.014 0.0441 *
## SEX2 FEMALE -0.15438 0.03677 -4.199 2.69e-05 ***
## AGELAST 0.06046 0.00112 54.002 < 2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE 0.12071 0.10611 1.138 0.2553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24025 on 18166 degrees of freedom
## Residual deviance: 20158 on 18162 degrees of freedom
## AIC: 20168
##
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>% group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithChol = sum(CHOLDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithChol/tot) ## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
prediction <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
prediction$Yval <- predict(fit, newdata = prediction, type = "response")
ggplot(prediction, aes(x = AGELAST, y = Yval, col = SDENICPROD)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of CHOLDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()filter_df <- small_df %>% select(DIABDX_M18 , SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$DIABDX_M18 <- relevel(filter_df$DIABDX_M18 , ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(DIABDX_M18 ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = DIABDX_M18 ~ SDENICPROD * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.252017 0.094413 -45.036 <2e-16 ***
## SDENICPROD1 YES -0.002929 0.105803 -0.028 0.978
## SEX2 FEMALE -0.084187 0.045799 -1.838 0.066 .
## AGELAST 0.044011 0.001399 31.462 <2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE 0.082964 0.146543 0.566 0.571
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15230 on 18178 degrees of freedom
## Residual deviance: 14010 on 18174 degrees of freedom
## AIC: 14020
##
## Number of Fisher Scoring iterations: 5
data_point <- filter_df %>% group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithBronch = sum(DIABDX_M18 == "1 YES"), tot = n()) %>%
mutate(frequency = WithBronch/tot) ## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")
# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of DIABDX_M18 by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()filter_df <- small_df %>% select(HIBPDX, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$HIBPDX <- relevel(filter_df$HIBPDX, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(HIBPDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = HIBPDX ~ SDENICPROD * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.023559 0.072194 -55.733 < 2e-16 ***
## SDENICPROD1 YES 0.410370 0.075027 5.470 4.51e-08 ***
## SEX2 FEMALE -0.177501 0.037231 -4.768 1.86e-06 ***
## AGELAST 0.066655 0.001154 57.764 < 2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE -0.178396 0.105726 -1.687 0.0915 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24447 on 18171 degrees of freedom
## Residual deviance: 19861 on 18167 degrees of freedom
## AIC: 19871
##
## Number of Fisher Scoring iterations: 4
data_point <- filter_df %>% group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithPress = sum(HIBPDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithPress/tot) ## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")
# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of HIBPDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()filter_df <- small_df %>% select(CALUNG, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$CALUNG <- relevel(filter_df$CALUNG, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(CALUNG ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = CALUNG ~ SDENICPROD * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.12639 0.85751 -7.144 9.04e-13 ***
## SDENICPROD1 YES 1.63074 0.45677 3.570 0.000357 ***
## SEX2 FEMALE 0.07510 0.26186 0.287 0.774257
## AGELAST 0.03470 0.01124 3.087 0.002019 **
## SDENICPROD1 YES:SEX2 FEMALE -0.41933 0.61786 -0.679 0.497337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 682.92 on 2536 degrees of freedom
## Residual deviance: 661.99 on 2532 degrees of freedom
## AIC: 671.99
##
## Number of Fisher Scoring iterations: 6
data_point <- filter_df %>% group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithPress = sum(CALUNG == "1 YES"), tot = n()) %>%
mutate(frequency = WithPress/tot) ## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")
# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of CALUNG by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()filter_df <- small_df %>% select(CHBRON31, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$CHBRON31 <- relevel(filter_df$CHBRON31, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(CHBRON31 ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = CHBRON31 ~ SDENICPROD * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.965273 0.252848 -23.592 < 2e-16 ***
## SDENICPROD1 YES 0.774959 0.259697 2.984 0.002844 **
## SEX2 FEMALE 0.491805 0.144139 3.412 0.000645 ***
## AGELAST 0.024422 0.003629 6.729 1.71e-11 ***
## SDENICPROD1 YES:SEX2 FEMALE 0.167296 0.314788 0.531 0.595102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2908.0 on 18138 degrees of freedom
## Residual deviance: 2829.6 on 18134 degrees of freedom
## AIC: 2839.6
##
## Number of Fisher Scoring iterations: 7
data_point <- filter_df %>% group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithPress = sum(CHBRON31 == "1 YES"), tot = n()) %>%
mutate(frequency = WithPress/tot) ## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")
# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of CHBRON31 by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()filter_df <- small_df %>% select(EMPHDX, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$EMPHDX <- relevel(filter_df$EMPHDX, ref = "2 NO")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- glm(EMPHDX ~ SDENICPROD * SEX + AGELAST, data = filter_df, family = "binomial")
summary(fit)##
## Call:
## glm(formula = EMPHDX ~ SDENICPROD * SEX + AGELAST, family = "binomial",
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.212236 0.286992 -28.615 < 2e-16 ***
## SDENICPROD1 YES 1.253981 0.193914 6.467 1e-10 ***
## SEX2 FEMALE -0.216396 0.114750 -1.886 0.05932 .
## AGELAST 0.069836 0.003905 17.883 < 2e-16 ***
## SDENICPROD1 YES:SEX2 FEMALE 0.696956 0.245796 2.836 0.00458 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3972.7 on 18178 degrees of freedom
## Residual deviance: 3497.9 on 18174 degrees of freedom
## AIC: 3507.9
##
## Number of Fisher Scoring iterations: 7
data_point <- filter_df %>% group_by(AGELAST, SEX, SDENICPROD) %>% summarise(WithPress = sum(EMPHDX == "1 YES"), tot = n()) %>%
mutate(frequency = WithPress/tot) ## `summarise()` has grouped output by 'AGELAST', 'SEX'. You can override using
## the `.groups` argument.
# Creating the prediction data frame
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
# Predicting values using the model
df_pred$Yval <- predict(fit, newdata = df_pred, type = "response")
# Plotting the results
ggplot(df_pred, aes(x = AGELAST, y = Yval, col = SDENICPROD)) +
geom_line() +
geom_point(data = data_point, aes(y = frequency)) +
facet_wrap(~SEX) +
labs(title = "Predicted Probability of EMPHDX by Age, Smoking Status, and Sex",
x = "Age (AGELAST)",
y = "Predicted Probability (Yval)") +
theme_minimal()To sum up all the results we can see above, even if the results are based on a different section of the population (because as we saw above cigarettes and electronic smoking devices are distributed in different ways), the results are pretty similar. With respect to cigarettes, the difference between smokers and non smokers are more accentuated in some cases, but most times are a little bit less present. That’s because the data in our possession about e-smokers are more distributed around younger ages, in which usually people tends to stay better regardless. Even with this specification, the variable is still very significant in most of our models, signalling that even if you smoke electronic cigarettes, not smoking at all tends to be increase the probabilities of staying better.
filter_df <-small_df %>% select(MNHLTH31, SDENICPROD, SEX, AGELAST) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$MNHLTH31 <- relevel(filter_df$MNHLTH31, ref = "1 EXCELLENT")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")filter_df$MNHLTH31 = ordered(filter_df$MNHLTH31)
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
fit <- vglm(MNHLTH31 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = filter_df)
matplot(df_pred$AGELAST, predict(fit, df_pred,type = "response"), type = "l",
ylab = "Predicted Probability", xlab = "Age", lwd = 2)
title("Mental Health by Age")
grid()
legend("left",legend = sort(unique(filter_df$MNHLTH31)), col = 1:5, lty = 1:5, lwd = 2, cex = 0.6)null <- vglm(MNHLTH31 ~ 1,
family = cumulative(parallel = TRUE),
data = filter_df)
pchisq(deviance(null)-deviance(fit), df <- df.residual(null)-df.residual(fit), lower.tail=FALSE)## [1] 0.1164615
fit <- vglm(MNHLTH31 ~ SDENICPROD + SEX + AGELAST,
family = cumulative(parallel = TRUE),
data = filter_df)
summary(fit)##
## Call:
## vglm(formula = MNHLTH31 ~ SDENICPROD + SEX + AGELAST, family = cumulative(parallel = TRUE),
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.5917390 0.0467452 -12.659 < 2e-16 ***
## (Intercept):2 0.8014889 0.0469039 17.088 < 2e-16 ***
## (Intercept):3 2.5282903 0.0510390 49.536 < 2e-16 ***
## (Intercept):4 4.2967733 0.0692881 62.013 < 2e-16 ***
## SDENICPROD1 YES -0.5626211 0.0407733 -13.799 < 2e-16 ***
## SEX2 FEMALE -0.2710512 0.0270943 -10.004 < 2e-16 ***
## AGELAST -0.0033686 0.0007562 -4.455 8.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
##
## Residual deviance: 49614.98 on 72677 degrees of freedom
##
## Log-likelihood: -24807.49 on 72677 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):4'
##
##
## Exponentiated coefficients:
## SDENICPROD1 YES SEX2 FEMALE AGELAST
## 0.5697138 0.7625775 0.9966371
data <- filter_df
formula <- MNHLTH31 ~ SDENICPROD + SEX + AGELAST
fit_model <- function(data, indices) {
boot_data <- data[indices, ]
model <- vglm(formula,
family = cumulative(parallel = TRUE),
data = boot_data)
return(coef(model))
}
set.seed(123)
bootstrap_results <- boot(data = data, statistic = fit_model, R = 100)
bootstrap_results##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = fit_model, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -0.591738980 1.433870e-03 0.0452970293
## t2* 0.801488911 -1.015003e-03 0.0474743409
## t3* 2.528290262 -8.625583e-04 0.0548296667
## t4* 4.296773306 1.272134e-03 0.0706058286
## t5* -0.562621062 5.643856e-03 0.0443905054
## t6* -0.271051163 -2.033241e-03 0.0240889656
## t7* -0.003368567 2.181939e-05 0.0007284386
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap_results, type = "basic")
##
## Intervals :
## Level Basic
## 95% (-0.6856, -0.5046 )
## Calculations and Intervals on Original Scale
## Some basic intervals may be unstable
smk_df = filter_df %>% filter(SDENICPROD == "1 YES")
n_smk_df = filter_df %>% filter(SDENICPROD == "2 NO")# For smokers
fit <- vglm(MNHLTH31 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = smk_df)
df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred)
cum_probabilities = t(apply(preds,1,plogis))
par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2,
ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft",
legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.5))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
legend = sprintf("Pr[Y = %g]", 1:5),
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.15))
mtext("Perceived Mental Health Status for E-Smokers", outer = TRUE, cex = 1.5, line = -3.5)# For non smokers
fit <- vglm(MNHLTH31 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = n_smk_df)
df_pred = data.frame(AGELAST = seq(min(n_smk_df$AGELAST), max(n_smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred) # check that they sum to 1
cum_probabilities = t(apply(preds,1,plogis))
par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2,
ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft",
legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.14))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
legend = sprintf("Pr[Y = %g]", 1:5),
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.37))
mtext("Perceived Mental Health Status for NOT E-Smokers", outer = TRUE, cex = 1.5, line = -3)filter_df <-small_df %>% select(ADGENH42, SDENICPROD, SEX, AGELAST, AGEGROUP) %>% filter(rowSums(sapply(., function(x) x %in% values_to_remove)) == 0)
# filter_df <- filter_df %>% filter(AGEGROUP == "65+")
filter_df$ADGENH42 <- relevel(filter_df$ADGENH42, ref = "1 EXCELLENT")
filter_df$SDENICPROD <- relevel(filter_df$SDENICPROD, ref = "2 NO")
filter_df$SEX <- relevel(filter_df$SEX, ref = "1 MALE")
fit <- vglm(ADGENH42 ~ AGELAST , data = filter_df, family = multinomial)
# summary(fit)
df_pred <- expand.grid(SEX = c("1 MALE", "2 FEMALE"), SDENICPROD = c("1 YES", "2 NO"), AGELAST = seq(18, 80))
matplot(df_pred$AGELAST, predict(fit, df_pred,type = "response"), type = "l",
ylab = "Predicted Probability", xlab = "Age", lwd = 2)
title("General Health by Age")
grid()
legend("bottomleft",legend = unique(filter_df$ADGENH42), col = 1:5, lty = 1:5, lwd = 2, cex = 0.6, inset = c(0, 0.25))## 2.5 % 97.5 %
## (Intercept):1 4.06916679 4.938852518
## (Intercept):2 3.95548228 4.808555597
## (Intercept):3 3.17255275 4.029734340
## (Intercept):4 1.49647559 2.398045613
## AGELAST:1 -0.05490893 -0.040851504
## AGELAST:2 -0.03600902 -0.022451069
## AGELAST:3 -0.02340982 -0.009822304
## AGELAST:4 -0.01105416 0.003210803
null = vglm(ADGENH42 ~ 1, family = multinomial, data = filter_df)
print(pseudo_R2 <- 1 - deviance(fit) / deviance(null))## [1] 0.02057306
## Analysis of Deviance Table (Type II tests)
##
## Model: 'multinomial', 'VGAMcategorical'
##
## Link: 'multilogitlink'
##
## Response: ADGENH42
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## AGELAST 4 791.25 55660 38461 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
filter_df$ADGENH42 = ordered(filter_df$ADGENH42)
fit <- vglm(ADGENH42 ~ SDENICPROD + SEX + AGELAST,
family = cumulative(parallel = TRUE),
data = filter_df)
summary(fit)##
## Call:
## vglm(formula = ADGENH42 ~ SDENICPROD + SEX + AGELAST, family = cumulative(parallel = TRUE),
## data = filter_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.1190151 0.0550320 -2.163 0.0306 *
## (Intercept):2 1.6909065 0.0565412 29.906 < 2e-16 ***
## (Intercept):3 3.3801217 0.0618049 54.690 < 2e-16 ***
## (Intercept):4 5.4288257 0.0807611 67.221 < 2e-16 ***
## SDENICPROD1 YES -0.5795052 0.0477565 -12.135 < 2e-16 ***
## SEX2 FEMALE -0.1732885 0.0311416 -5.565 2.63e-08 ***
## AGELAST -0.0265767 0.0009036 -29.410 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
## logitlink(P[Y<=3]), logitlink(P[Y<=4])
##
## Residual deviance: 37518.88 on 55657 degrees of freedom
##
## Log-likelihood: -18759.44 on 55657 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):4'
##
##
## Exponentiated coefficients:
## SDENICPROD1 YES SEX2 FEMALE AGELAST
## 0.5601755 0.8408950 0.9737734
data <- filter_df
formula <- ADGENH42 ~ SDENICPROD + SEX + AGELAST
fit_model <- function(data, indices) {
boot_data <- data[indices, ]
model <- vglm(formula,
family = cumulative(parallel = TRUE),
data = boot_data)
return(coef(model))
}
set.seed(123)
bootstrap_results <- boot(data = data, statistic = fit_model, R = 100)
bootstrap_results##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = fit_model, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -0.11901505 0.0013523776 0.0524053679
## t2* 1.69090651 -0.0001473078 0.0571454857
## t3* 3.38012166 -0.0009748008 0.0661845763
## t4* 5.42882571 0.0039747501 0.0782417896
## t5* -0.57950520 0.0049793864 0.0495783138
## t6* -0.17328846 0.0004614767 0.0283499062
## t7* -0.02657666 -0.0000689214 0.0008788527
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 100 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap_results, type = "basic")
##
## Intervals :
## Level Basic
## 95% (-0.2264, -0.0138 )
## Calculations and Intervals on Original Scale
## Some basic intervals may be unstable
smk_df = filter_df %>% filter(SDENICPROD == "1 YES")
n_smk_df = filter_df %>% filter(SDENICPROD == "2 NO")# For smokers
fit <- vglm(ADGENH42 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = smk_df)
df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred)
cum_probabilities = t(apply(preds,1,plogis))
par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2,
ylab = "Cumulative probabilites", xlab = "AGELAST")
grid()
legend("bottomleft",
legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.25))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
legend = sprintf("Pr[Y = %g]", 1:5),
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.08))
mtext("General Health Status for E-Smokers", outer = TRUE, cex = 1.5, line = -3.5)# For not smokers
fit <- vglm(ADGENH42 ~ AGELAST,
family = cumulative(parallel = TRUE),
data = n_smk_df)
df_pred = data.frame(AGELAST = seq(min(smk_df$AGELAST), max(smk_df$AGELAST), by = 1))
preds = predict(fit, df_pred)
cum_probabilities = t(apply(preds,1,plogis))
par(mfrow = c(1,2))
matplot(df_pred$AGELAST,cum_probabilities, lty = 1, type = "l", lwd = 2,
ylab = "Cumulative probabilites", xlab = "AGELAST")+
grid()## integer(0)
legend("bottomleft",
legend = gsub("link", "", colnames(cum_probabilities)), # remove link from labels
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.3))
category_probabilities = t(apply(cbind(0,cum_probabilities,1),1,diff))
preds = predict(fit, df_pred, type = "response") # check that they are equivalent
matplot(df_pred$AGELAST, preds, lty = 1, type = "l", lwd = 2,
ylab = "Category probabilites", xlab = "AGELAST")
grid()
legend("topright",
legend = sprintf("Pr[Y = %g]", 1:5),
col = c(1,2,3,4,5), lty = 1, cex = 0.6, xpd = TRUE,
inset = c(0, 0.08))
mtext("General Health Status for NOT E-Smokers", outer = TRUE, cex = 1.5, line = -3.5)Also in this case the results for Perceived Mental Health and General Health are quite similar to those previously discussed. A difference we can point out is a general improvement in perceived mental health for E-Smokers (compared to cigarette smokers) and a general worsening in general health for E-Smokers. One interpretation of this could be that some people are switching to E-smoking because of poor health, and what we are seeing is due to external factors or previous habits. Unfortunately, we do not have access to this type of data in this dataset, so we will limit our discussion to what we are observing.
Our study investigates the negative health effects associated with smoking, focusing on both physical and mental health outcomes. Using the 2021 Medical Expenditure Panel Survey Household Component data, which includes responses from 15,000 households in the United States, we compared the health impacts of smoking with those of using electronic nicotine products (e-nic products). Our analysis reveals that overall general and mental health is better among non-smokers compared to both smokers and e-nic product users.
However, the data set’s limited size constrains the robustness of our findings. Larger and more comprehensive data would likely provide more definitive insights into these health relationships. Additionally, our understanding of the health effects associated with e-nic products remains incomplete due to the relatively new nature of these products and the insufficient data available in our current dataset. Future research should aim to gather more extensive data on e-nic product usage to better assess its impact on mental and general health.
Despite these limitations, our findings contribute to the growing body of evidence highlighting the adverse health effects of smoking and underscore the need for continued public health efforts to reduce smoking rates and further investigate the health implications of e-nic products.